> Retries are tricky because depending on how the job is implemented they can ca...

> Retries are tricky because depending on how the job is implemented they can cause more harm than good.

Ah, yes, there's nothing like bringing capacity back online only to have it crushed by all your customers retrying at the same time.

AWS got bit hard by this[3] but there's a blog post[1] about it, which is linked to by the docs for their client software[2].

[1] https://aws.amazon.com/blogs/architecture/exponential-backof...

[2] https://docs.aws.amazon.com/general/latest/gr/api-retries.ht...

[3] https://aws.amazon.com/message/5467D2/ ... basically DynamoDB is a fundamental service for AWS and had implemented some new streams features. This all appeared to be working, but they were running closer to capacity than intended, and when a cluster went out this caused a cascading failure.

And see this which linked to that RCA: https://blog.scalyr.com/2015/09/irreversible-failures-lesson...