Hacker News new | past | comments | ask | show | jobs | submit login
Retries – An interactive study of request retry methods (encore.dev)
251 points by whenlambo 9 months ago | hide | past | favorite | 53 comments



This still isn't what I'd call "safe". Retries are amazing at supporting clients in handling temporary issues, but horrible for helping them deal with consistently overloaded servers. While jitter & exponential backoff help with the timing, they don't reduce the overall load sent to the service.

The next step is usually local circuit breakers. The two easiest to implement are terminating the request if the error rate to the service over the last <window> is greater than x%, and terminating the request (or disabling retries) if the % of requests that are retries over the last <window> is greater than x%.

i.e. don't bother sending a request if 70% of requests have errored in the last minute, and don't bother retrying if 50% of the requests we've sent in the last minute have already been retries.

Google SRE book describes lots of other basic techniques to make retries safe.


Finagle fixes this with Retry Budgets: https://finagle.github.io/blog/2016/02/08/retry-budgets/


Totally! Thanks for bringing those up. I tried to keep the scope specifically on retries and client-side mitigation. There's a whole bunch of cool stuff to visualise on the server-side, and I'm hoping to get to it in the future.


Your response makes it sound like you think circuit breakers are server side and not related to retries. They are not; they are a client-side mitigation that are a critical part of a mature retry library.


The client can track its own error rate to the service, but it would need information from a server to get the overall health of the service, which is what the author probably means. Furthermore the load balancer can add a Retry-After header to have more control over the client's retries.


I think I've misunderstood what circuit breakers are for years! I did indeed think they were a server-side mechanism. The original commenter's description of them is great, you can essentially create a heuristic based on the observed behaviour of the server and decide against overwhelming it further if you think it's unhealthy.

TIL! Seems like it can have tricky emergent behaviour. I bet if you implement it wrong you can end up in very weird situations. I should visualise it. :)


I mean, they can and should be both. Local decisions can be cheap, and very simple to implement. But global decisions can be smarter, and more predictable. In my experience, it's incredibly hard to make good decisions in pathological situations locally, as you often don't know you're in a pathological situation with only local data. But local data is often enough to "do less harm" :)


Do you have a newsletter?


Not a newsletter as such but I do have an email list where I post whenever I write something new. You can find it here: https://buttondown.email/samwho


This is one of those things that sort of exposes our industry maturity versus other engineering that's been around longer. You would think by now that the various frameworks for remote calls would have standardized down to include the best practice retry patterns, with standard names, setting ranges, etc. But we mostly still roll our own for most languages/frameworks. And that's full of footguns around DNS caching, when/how to retry on certain failures (unauthorized, for example), and so on.

(Yes, there should also be the non-abstracted direct path for cases where you do want to roll your own).


> You would think by now that the various frameworks for remote calls would have standardized down to include the best practice retry patterns, with standard names, setting ranges, etc.

There is a school of thought that argues that the best retry pattern is no retry at all, and just get the client to fail and handle that state.

One of the driving arguments is that retries are a lazy way to try to move faults from the client onto the server, and in the process cause more harm (i.e., DDoS).

Sometimes complex means wrong, and all these retry strategies are getting progressively more complex at the expense of hammering servers with traffic way beyond the volume it's designed to handle. How is that a decent tradeoff?


I disagree. I think the trade-off is very reasonable. At some point you need to retry (even if the trigger is user manually pressing F5 in the browser/clicking a button again/running a program again). Because they actually have some goal to accomplish.

Some failures really are random, let's say 0.1% of requests fail. For a sufficiently complex backend/operation, one user request can easily generate 100 internal requests that can fail. If you don't retry, this adds up to a non-negliglible chance that a whole user facing operation fails and all 100 requests have to be retried - you actually increased the number of requests that had to be made! As an extreme example, imagine that during training ChatGPT one request failed, and whole training has to be started from scratch because we don't do retries.


> I disagree. I think the trade-off is very reasonable. At some point you need to retry (even if the trigger is user manually pressing F5 in the browser/clicking a button again/running a program again). Because they actually have some goal to accomplish.

I don't think your belief holds water if you think about your example. The goal of a retry from a client standpoint is to introduce an acceptable delay in order to pretend the original request was successful. This strategy is only valid if the number of retries are enough to not penalize perceived performance or the normal operational state of a service. Consequently, all retry strategies involve sending multiple requests per second. The link to Retry Budgets posted in this discussion explicitly mentions "a minimum of 10 retries per second."

A user pressing F5 will never generate this volume of requests.

> Some failures really are random, let's say 0.1% of requests fail.

That's why failing fast and not retry is the best strategy for most if not all applications. Retry strategies introduce high levels of complexity to a task that only rarely happens, and in the rare case that it happens it can be trivially fixed by the user triggering a refresh.

If it's an applications that already outputs a high volume of requests, once your first request fails then it will simply post again a request as part of their happy path.

Some developers like retries because they use it to patch their broken code path to pretend that they do not have to deal with scenarios where a network is not 100% reliable. They onboard a retry library, they update their requests to transparently appear to be a single request, and they proceed as if their application doesn't have a failure mode. Except it does, but now they also decide to tradeoff their wishful thinking with higher risk of causing a cascading DDoS attack on their own infrastructure.


> That's why failing fast and not retry is the best strategy for most if not all applications.

I think it's more complex than this. You also have to lump timeouts, caching and failure behavior into the conversation. And there are also situations where you absolutely need some amount of retries. Say, for example, you want seamless failover between backends...you're expecting some failures and don't want or need to expose those to your end users. Or, maybe the "end user" isn't a person. Like, for example, finalizing a financial transaction from a queue.


Summary of the article: use exponential backoff + jitter for retry intervals.

What author didn’t mention: sometimes you want to add jitter to delay the first request too, if the request happens immediately after some event from server (like server waking up). If you don’t do this, you may crash the server, and if your exponential backoff counter is not global you can even put server into cyclic restart.


If you can crash the server with an improperly timed request, then you have a much bigger problem than client-side stuff.


I think what they mean is something that would cause client to do something at the same time (could be all sorts, some synchronised crash, aligning timers to clock-time, etc.). If the requests aren't user-driven then yes, you likely would want to include some jitter in the first request too.

Funnily, you'll notice that some of the visualisations have the clients staggering their first request. It's exactly for this reason. I wanted the visualisations to be as deterministic as possible while still feeling somewhat realistic. This staggering was a bit of a compromise.

Not sure what is meant by "if your exponential backoff counter is not global", though. Would love to know more about that.


True, but you can imagine something like a websocket to all clients getting reset and everyone re-connecting, re-authenticating, and getting a new payload.


One example is if a datacenter loses power and then all the hosts get turned on at the same time they can all send requests at the same time and crash a server.


Yes. Worst that should happen is getting a 404 or something. A crash due to requesting a piece of data that has not yet been created is poor design.


Yup, classic Thundering Herd Problem


Really nice animations, I especially liked the demonstration of the effect that after some servers will "explode", any server that will be restarted will automatically be DoS'ed until we'll throw a bunch of extra temporary servers into the system. Thanks.


Yeah! An insidious problem that’s not obvious when you’re picking a retry interval.

I had fun with the details of the explosion animation. When it explodes, the number of requests that come out is the actual number of in-progress requests.


The animations are so cool!!!

In general the phenomena is known as _metastable failure_ that could be triggered when there are more things to do during failure than normal run.

With retry, the client do more work within the same amount of time, compared to doing nothing or doing exponential backoff.


For a lot of things, retry once and only once (at the outermost layer to avoid multiplicative amplification) is more correct. At a large enough scale, failing twice is often significantly (like 90%+) correlated with the likelihood of failing a third time regardless of backoff / jitter. This means that the second retry only serves to add more load to an already failing service.


Correct. It's also the case that human generated requests will lose their relevance within seconds, a quick retry is all it's worth. As for machine generated requests a dead letter queue would make more sense, poor engineered backend services would OOM and well-engineered would load shed, if the requests are queued on the application servers they are doomed to be lost anyway.


Retrying end-to-end instead of stepwise greatly reduces the reliability of a process with a reasonable number of steps.

That being said, processes should ideally be failing in ways which make it clear whether an error is retryable or not.


A must-read (or rather: must-see) for anyone who thinks exponential backoff is overrated.


> A must-read (or rather: must-see) for anyone who thinks exponential backoff is overrated.

I don't think exponential backoffs were ever accused of being overrated. Retries in general have been criticized for being counterproductive in multiple aspects, including the risk of creating self-inflicted DDOS attacks, and exponential backoffs can result in untenable performance and usability problems without adding any upside. These are known problems, but none of them is hardly classified as "overrating".


Thanks for sharing!

I’m the author of this post, and happy to answer any questions :)


There's a subtle insight that could be added to the post if you consider worth it, and it's something that's actually there already, but difficult to realize: Clients in your simulation have an absolute maximum number of retries.

I noticed this mid-read, when looking at one of the animations with 28 clients, that they would hammer the server but suddenly go into wait state, without apparent reason.

Later in the final animation with debug mode enabled, the reason becomes apparent for those who click on the Controls button:

Retry Strategy > Max Attempts = 10

It makes sense, because in the worst case when everything goes wrong, a client should reach a point where it desists and just aborts with a "service not available" error.


You know, I hadn't actually considered mentioning it. Another commenter brought it up, too. It's so second nature I forgot about it entirely.

I'll look about giving it a nod in the text, thank you for the feedback. :)


Exponential retries can effectively have a maximum number of requests if the gap between retries gets long enough quickly enough. In practice, the user will refresh or close the page if things look broken for too long.


Oh, please don't do that.

Unbounded exponential backoff is an horrible experience, and improves basically nothing.

If it makes sense to completely fail the request, do it before the waiting becomes noticeable. If it's something that can't just fail, set a maximum waiting time and add jitter.


I think decoupling retry logic from the “there’s something wrong” UI ends up being a better experience than tieing the UI state to the details of network retries. (For one thing, it gives you a chance to fix the “everything is broken” UI without any action on the user’s part.


Thanks for this -- it's really great!

One thing I noticed is that the post is very first-principles right up to where it reaches exponential backoff. At that point, it quickly jumps to "and here's exponential backoff and here's some good parameters". But I've worked on a lot of systems that got those wrong. In both directions: too-short caps that were insufficient for the underlying system to recover and too-long caps so that even when the servers _did_ recover, clients weren't even going to try again for way too long (e.g., 2 days). It'd be neat to have another section or two exploring those tradeoffs.

I really want one of these visual explorations for the idea of margin. Concretely: it's common to have systems at, say, 88% CPU utilization that appear to be working great. Then you ramp them up to like 92% and start seeing latency bubbles of multiple seconds or even tens of seconds. We tend to think of that idle time as waste, but it's essential for surviving transient blips in load. I increasingly feel like this concept is really fundamental and ought to be taught in like high school because it applies so many places (e.g., emergency funds, in the realm of personal finance).


What technology did you use for the animations? I've a bunch of itches I'd like to scratch that would be improved by having some canvas animated explainers or UI but I never clicked with anything. D3 back in the day.

A rudimentary look in the source code showed a <traffic-simulation/> element but I'm not up to date enough with web standards to guess where to look for that in your JS bundle to guess at the framework!


It uses PixiJS (https://pixijs.com/) for the 2D rendering and GSAP3 (https://gsap.com/) for the animation. The <traffic-simulation /> blocks are custom HTMl elements (https://developer.mozilla.org/en-US/docs/Web/API/Web_compone...) which I use to encapsulate the logic.

I've been thinking about creating a separate repo to house the source code of posts I've finished so people can see it. I don't like all the bundling and minification but sadly it serves a very real purpose to the end user experience (faster load speeds on slow connections).

Until then feel free to email me (you'll find my address at the bottom of my site) and I'd be happy to share a zip of this post with you.


I've uploaded the code for all of my visualisation posts here: https://github.com/samwho/visualisations.

Enjoy! :)


Thank you so much for the detailed reply. pixijs looks amazing, and gsap looks pretty approachable!


This is the client side of things. And I think this is a great resource that everyone who writes clients for anything, should see.

But there is an additional piece of info everyone who writes clients needs to see: And that's what people like me, who implement backend services, may do if clients ignore such wisdom.

Because: I'm not gonna let bad clients break my service.

What that means in practice: Clients are given a choice: They can behave, or they can

    HTTP 429 Too Many Requests


> This is the client side of things.

The article is about making requests, and strategies to implement when the request fails. By definition, these are clients. Was there any ambiguity?

> But there is an additional piece of info everyone who writes clients needs to see: And that's what people like me, who implement backend services, may do if clients ignore such wisdom.

I don't think this is the obscure detail you are making it out to be. A few of the most basic and popular retry strategies are designed explicitly with a) handling throttled responses by the servers, b) mitigate the risk of causing self-inflicted DDoS attacks. This article covers a few of those, such as the exponential backoff and jitters.


> Was there any ambiguity?

Did I say there was?

> I don't think this is the obscure detail you are making it out to be

Where did I call this detail "obscure"?

My post is meant as a light-hearted, humorous note pointing out one of the many reasons why it is in general a good idea for clients to implement the principles outlined in the article.


Throttling, tarpitting, and circuit-breakers are something I'd love to visualise in future, too. Throttling on its own is such a massive topic!


> Did I say there was?

Yes. You explicitly wrote in your comment "This is the client side of things", as if there was any ambiguity in where requests came from.

> Where did I call this detail "obscure"?

You explicitly wrote that "(...) there is an additional piece of info everyone who writes clients needs to see" on what people like you "who implement backend services, may do if clients ignore such wisdom", as if somehow this was obscure, arcane and secret knowledge that no team whatsoever working on backend services with client teams ever dared share with the outside world.

> (...) why it is in general a good idea for clients to implement the principles outlined in the article.

I don't think there is a single developer out there working on client/server projects that aren't aware of the need to handle request failures and be mindful of service level agreements, specially when dealing with retries.

Your post reads like addressing life guards to let them know that there is an additional piece of info everyone who works as a lifeguard needs to see: that the water is wet.


Remember to limit the exponential backoff interval if you are not limiting the number of retries


I worked at a company with a self-inflicted wound related to retries.

At some point in the distant (internet time) past, a sales engineer, or the equivalent, had written a sample script to demonstrate basic uses of the API. As many of you quickly guessed, customers went on a copy/paste rampage and put this sample script into production.

The script went into a tight loop on failure, naively using a simple library that did not include any back-off or retry in the request. I'm not deeply familiar with how the company dealt with this situation. I am aware there was a complex load balancing system across distributed infrastructure, but also, just a lot of horsepower.

Lesson for anyone offering an API product: don't hand out example code with a self-own, because it will become someone's production code.


I have been thinking about queueing theory lately. I don't have the math abilities to do anything deep with it, but it seems like even basic applications of certain things could prove valuable in real world situations where people are just kind of winging it with resource allocation.


If a picture is worth 1,000 words, then what's a well made animation worth? These are great intuitive representations of your retry methods. Bravo!


Thank you! <3


The pale red failed retries should be more kiki-like, the way they are now, their pointedness is hard to see when theyre moving


Exponential backoff doesn't apply for successful requests right? The simulation doesn'T reflect that i think. peace


It doesn’t apply to successful requests, that’s right.

The simulation retries failed requests using various retry strategies, and then after a successful request will wait a configured amount before sending the next request.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: