Hacker News new | past | comments | ask | show | jobs | submit login
Misconfigured Circuit Breakers (shopify.com)
178 points by rbanffy on March 11, 2020 | hide | past | favorite | 39 comments



This might be my bias as someone who mostly writes in actor-based concurrent languages, but: is there a reason to test the service’s back-to-life-ness by passing a real request through, rather than just having every open circuit trigger the creation of a periodic background “/health”-endpoint (or equivalent) poller actor for that backend within your service? To do the check within the client request lifecycle seems like it would needlessly increase client-request 99th-percentile latency for no real benefit, other than the single request which doesn’t time out, closes the circuit, and so gets served from the service.

ETA: you could even—presuming your system does solely idempotent stuff—take the exact request whose timeout caused the circuit to open, and pass that request (the particular query for SQL; the particular URI + POST/PUT data for REST) to the poller-actor you’re spawning to use as its health-check. Once that request doesn’t time out any more, you know you’re back online.

I’d recommend a real separate /health endpoint for the service you’re trying to contact, though, because 1. most applications do non-idempotent things, and 2. some queries time out because they touch edge-case bad/unscalable code paths, not because the remote-service-as-a-whole is down. If your `SELECT * FROM expensive_generated_report_view` SQL query times out, you really should not be reusing that query as a health-check against your RDBMS. (But nor should you just be doing `SELECT 1`!)


A /health endpoint or similar mock might be available and responding fast, while a real request might fail or time out.


Well, that’s not a “true” /health endpoint, then. A service’s /health endpoint should run through its regular non-trivial code paths and depend for its success on all of the dependent services normal requests to that service depend on. (Probably you’ll need to write it yourself, rather than using one supplied by your application framework.)

For example, if you have a CRUD app fronting a database, your CRUD app’s /health endpoint should attempt to make a simple database query. If you have an ETL daemon that pulls from a third-party API and pushes to a message queue, it should probe both the readiness of the API and the message-queue before reporting its own readiness to work. (Of course, it is exactly in the case where the service has its own circuit-breaking logic with back-up paths “around” these dependencies, that it gets to say it’s healthy when its dependencies aren’t.)

A test of the /health endpoint is, after all, what determines if a service is considered viable for routing by load-balancers; and, vice-versa, if the service is considered to be “stuck” and in need of a restart by platform auto-scaling logic. As such, your /health endpoints really should be configured in a way where they generate false positives—reporting being unhealthy when they’re really healthy—rather than false negatives.

If you’ve got a pool of instances, better to have them paranoid of their own readiness in a way where your system will be constantly draining traffic from + restarting them, than to have them lackadaisical about their readiness in a way where they’re receiving traffic they can’t handle.


> Well, that’s not a “true” /health endpoint, then

You cannot make such a "true" health endpoint, it's super easy to make a service that contains a paradox about what such an endpoint should do.

1 service with 2 endpoints A and B. A relies on an external service and the database, A and B both rely on the database. What to do if the external service is down bringing A down? Either your health endpoint is useless because A is down and it reports the service is fine or you just cascaded the downtime to B while there is no reason to do so. Same situation for 1 endpoint but any branch dependent on the request etc. etc.

Of course you can use a health endpoint for determining restarts, load balancers etc, but its not replacement for circuit breakers on your calls.


> you cannot make such a "true" health endpoint

Well, you can make such an endpoint, you already have. It's called...

Your endpoint.

The answer to the top level question is, "because it's easier, more accurate, and more maintainable to call a real endpoint than to try to maintain an endpoint whose sole purpose is to predict whether your other endpoints are actually working."

Aka: Just Ask.


> 1 service with 2 endpoints A and B. A relies on an external service and the database, A and B both rely on the database. What to do if the external service is down bringing A down?

Make one /health/a endpoint and one /health/b endpoint. Client-service A uses /health/a to check if the service is "healthy in terms of A's ability to use it." Client-service B likewise pings /health/b.

In a scenario with many different dependent services (a crawler that can hit arbitrary third-party sites, say, or something like Hadoop that can load arbitrary not-necessarily-existent extensions per job) these endpoints should be something clients can create/register themselves, ala SQL stored procedures; or the service can offer a connection-oriented health-state-change streaming endpoint, where the client can hold open a connection and be told when about readiness-state-change events.

But to be clear, these are edge-case considerations: in most cases, a service has only critical-path dependencies (which it needs to bootstrap itself, or to "do its job" in the most general SLA sense); and optional dependencies (which it doesn't actually need, and can offer degraded functionality when the service isn't available via circuit-breaking.)

It's a rare—and IMHO not-well-factored—service that has dependencies that are on the critical path for some use-cases but not others. Such a service should probably be split into two or more services: a core service that all use-cases depend on; and then one or more services that each just do the things unique to a particular use-case, with all their dependencies being on the critical path to achieve their functionality of serving that specific use-case. Then, those use-case-specific services can be healthy or unhealthy.

An example of doing this factoring right: CouchDB. It has a core "database" service, but also a higher-level "view-document querying" service, that can go down if its critical-path dependency (a connection to a Javascript sandbox process) isn't met. Both "services" are contained in one binary and one process, but are presented as two separate collections of endpoints, each with their own health endpoint.

An example of doing this factoring wrong: Wordpress. It's got image thumbnailing! Comment spam filtering! CMS publication! All in one! And yet it's either monolithically "healthy" or "unhealthy"; "ready" or "not ready" to run. That is clearly ridiculous, right?


>Make one /health/a endpoint and one /health/b endpoint. Client-service A uses /health/a to check if the service is "healthy in terms of A's ability to use it." Client-service B likewise pings /health/b.

I've done exactly this, and it worked well in my case were the # of related services was pretty small. Each endpoint would return an HTTP status code indicating overall health with additional details stating exactly which checks succeeded or not.


> And yet it's either monolithically "healthy" or "unhealthy"; "ready" or "not ready" to run. That is clearly ridiculous, right?

That's my point. If your editor cannot create new articles that's a problem that needs to be resolved.

I probably wouldn't want all my editors smashing the submit button until it worked, leading to additional overload problems. So I open the "circuit breaker" for submission by sending an email to all the editors to please stop submitting while wordpress is broken.

But shutting down the complete website would make the problem much worse because readers wouldn't see anything.

Conclusion: It's neither 100% healthy nor 100% unhealthy. Healthiness cannot be captured by a single boolean.


Sometimes its not a good idea for the health of a service to be determined by its connected parts (eg databases). For purely situational awareness this is fine. But if you use the healthcheck to determine if an instance of an application should be taken out of service you risk cascading failures; turning 1 problem into 10. It's usually better for the application to throw an error if it can't connect to the database. That said I do both approaches depending on the situation.


Two modes of cascading failure here: Request of Death, and cascading failures. If a request kills a particular server you should let the error flow upstream, otherwise it will just bounce from server to server until it's killed all of them.

For the latter, someone related a real-world example of this to me the other day. Say you have a bunch of people managing customers. Every employee has 4 customers, and those take up all of their time.

You get a new customer. Instead of hiring a new rep, you give someone a 5th customer to manage. They struggle, and eventually they quit. Now, all of your employees have 5 customers. Sooner or later one of those will also quit, and then it's a race to see who can get out the door fastest.

The moral of that story is that all the load balancing in the world is for naught if you haven't done your capacity planning properly. And once the system starts to buckle it may be too late to bring new capacity online (since startup usually consumes more resources).


I mean... that's what circuit breakers are for. If a component of a service is optional to its operation, then it wraps calls to that service in a circuit-breaker and fails requests that ask for that service. And if a component of a service is not optional to the operation of a service, then the failure should cascade to the service's dependent clients, and their dependent clients, and so on, so that there's backpressure all the way back to the origin of the request.


Health checks can have several purposes. They're used by the routing control plane to determine inclusion in the load balancer pool. This is already a kind of circuit breaker and is similar to what an application-level circuit breaker would poll. But they're also used by the scheduler to determine whether the instance needs to be restarted. You don't want your thing in a restart loop just because a dependency is down! In fact, if a very widely shared dependency went down, and everyone was checking it in their health check, the scheduler control plane could quickly have a backlog measured in days trying to move all those instances.

Our environment now supports distinct answers to those two questions but most service authors don't know about it.


So some of the responses to you seem to be missing the point. I agree with you, and with a health check being the wrong thing to circuit break on.

Not matter how complex a health endpoint is, the circuit breaker is more so. The health endpoint might make a legit request, checking all downstream services, etc etc, to confirm that everything is working...but the actual traffic is many different requests, with different code flows.

As an example, imagine service Bs health check actually makes a request. Great. Only, service B is misconfigured to talk to staging, whereas the upstream, service A, is in prod. The health check may always pass; after all, the staging request's data all gets resolved and handled appropriately. The upstream one, however, leads to constant failure, as that data doesn't exist.

Health check passes 100% of the time. Actual upstream calls pass 0% of the time. Don't have your upstream calls circuit break on health checks; have them circuit break on actual responses.

(And yes, you should have network isolation between staging and prod so that particular example can't happen. It is intentionally a contrived example)


My original point was that a health-check is a good preflight check before the circuit-breaker goes "half open" in the article's terminology, letting traffic flow through again as a test to see whether the service has recovered. The circuit breaker should still do that—but it should do that after a health-check tells it the service has recovered. That way, you don't have random timeouts as your service abuses user-requests as a bad way of finding out the trivial information that it can't even connect to the service.

Instead, the circuit breaker will first get an upper-bound on healthiness from the /health endpoint—and that'll be enough to tell it when even the upper bound says the service is unhealthy. Then, when the upper bound says "healthy", it'll seek a tighter bound on healthiness by letting through actual requests in the "half-open" state.

In this way, depending on how you've written your /health endpoint, you can ensure that far fewer users get their request abused as the scope for the service's health-checking reconaissance-in-force mission; and so decrease your 99th-percentile latency. (Since, of course, a background poller hitting the /health endpoint is outside the request flow, and so doesn't count toward 99th-percentile latency.)


you don't mock it, you implement a real one that acts as a remote microservice. Sometimes it's a /health route on the same site, sometimes it's a status.example.com, etc. The only thing it should do is give you a static errors-or-not file that gets regenerated at some reasonable interval by an independently running process, off of all known dependencies that you control.

If the health route says everything's up and running, then you can try a normal request, and if that fails, you back off anew (and you log an error entry that can be used to update the health endpoint's data generator, because something, somewhere, is misreporting)


The only way a health endpoint is not mocking anything is when it does a large variety of full valid requests, having the same complexity and functionality of the service. Oh and all of them should fail or succeed identically. At that point why not use the real request.

Or you're overshooting. Say you have a service with 2 endpoints, both depend on the DB and 1 depends on an external API. External API goes down, you don't have a correct choice if you have 1 health endpoint for circuit breakers. If you open the breaker you needlessly disable requests to the healthy endpoint. If you leave it closed your circuit breaker is not working for the broken endpoint.

You can of course use a health check in addition to the normal endpoint in your circuit breaker, but then its just a fail-faster for the scenarios that the health endpoint covers.


> You can of course use a health check in addition to the normal endpoint in your circuit breaker, but then its just a fail-faster for the scenarios that the health endpoint covers.

Er, yes, that was the thing I was trying to suggest. You want to fail fast (or, more precisely, keep the system in the "circuit open, degraded functionality code-path" state) so your users don't experience mysterious latency spikes when their request becomes the guinea-pig for the circuit-breaker. That'll still happen, but not as often if a health check can tell the service there's not even a chance the service is worth checking via user-request.


Consider randomizing error_timeout, to avoid a possible case where all clients retry at the same time.


We use this quite a bit but we actually have a constant timeout with random "jitter" added to it. You almost never want the timeout completely randomized because there's a chance you'll get a zero.


The concept of “completely randomizing” something isn’t really well-defined, it’s more of a casual phrase that people use without thinking about it. If you have a constant timeout with random jitter added to it, you’ve randomized it. It is now random.

If you think about it, a “perfectly random” die roll on an ordinary die will never be 0. So, whether a value is random has no relationship with whether that value can be 0.


you don't want it random. You want to add a splay .e.g

# add a random value (from 0 to X) to the timeout # This can be used to prevent the thundering herd problem splay = "5s"


After reading your comment, I did consider to implement that in my codebase and then I remembered that I have so few servers that it would not matter. Even if my company has to scale up to 99% of the market.


I suggest you do it anyway. It's a one line code or config change, yet if you do suffer the "stampeding herd" problem, it's hard to diagnose and hard to fix in the middle of another production outage.


("thundering" herd in case anyone wants to google it)


The author seems to think completely in terms of "wasted utilization" when it comes to timeouts. I think they are missing the point of the timeouts and the retry logic to begin with. The effort by the circuit breaker isn't wasted, because it is exactly trying to establish whether a resource is responding or not taking into account occasional network hiccups. If every effort past the initial timeouts was wasted, then why implement this logic to begin with? I agree with derefr (https://news.ycombinator.com/item?id=22546241) in the sense that it seems illogical to increase latency for users simply to check for availability of a timed-out resource.

IMHO the worst-case assumption of all service instances failing simultaneously leads the author astray in their quest to reduce "wasted utilization".

Pretend the network switch rebooted and all services were unavailable for a short period of time, but your website is in high demand, so the error threshold of three errors per resource was quickly reached. Let's pretend the network switch needed 5 seconds to reboot, so 42 resources each failing 3 times in that time equals 126 requests/5 seconds, 25.2 requests/second. Now, instead of quickly recovering from that state after two seconds, the author advises to instead wait 30 seconds, so that's 756 requests---because your site is so popular---before the first service is retried. Then an additional 41 requests (~1.67 seconds) until all resources are marked available again. So now you made about one thousand people unhappy in case it's their browsing session that's constantly lost. Unless of course your were too optimistic when setting the half_open_resource_timeout, because then your services might be blocked for multiples of error_timeouts, e.g. minutes with a high error_timeout value of 30 seconds. That's a lot more than a thousand people unable to log in.

IMHO setting the half_open_resource_timeout way lower than the regular service_timeout value will just risk the services _never_ becoming available again after an internal network outage in your data center. That seems like a recipe for disaster.


Good read!

This sounds similar to the work I did [1] on the Datadog Agent, especially regarding the concept of each resource having its own circuit breaker.

My implementation is a bit different though, instead based on exponential decay like BGP Route Flap Damping [2]. It matches our use case a bit better and is easier to reason about.

[1]: https://github.com/DataDog/datadog-agent/pull/1458

[2]: https://tools.ietf.org/html/rfc2439#section-2.3


> Utilization, the percentage of a worker’s maximum available working capacity, increases indefinitely as the request queue builds up, resulting in a utilization graph like this.

Keyword indefinitely. Isn't this assuming the service worker doesn't have a timeout itself?


Even with a timeout, the time spent waiting on the timeout is wasted utilization. Therefore, if the request rate stays the same, and each request is wasting enough utilization, the utilization required to process the request will be higher than the rate it is worked off.


This is apparently a software function known as a circuit breaker, and has nothing to do with electrical current flow.

Fuckin' HN headlines, I swear.


It's called an analogy. It functions similarly to an electrical circuit breaker because it disables systems before things go fully haywire/bad, it helps devs understand the concept by using familiar things


Although it behaves a lot more like a GFCI than a circuit breaker, but apparently nobody involved ever watched This Old House, or any other home improvement shows.

I would like these throttling tools to behave in a different way, one that more closely resembles a circuit breaker. But some rat bastard has already used that name so now I don't know what to call it.


It's bad naming practice and only adds confusion to search.


It's a common pattern in software engineering to stop failures jumping from one area of a system to another. I don't get why you're mad about that. It functions analogous to the very thing it's named after.


yeah I got a little excited about how I might fix the circuit breaker in my house.


> how I might fix the circuit breaker in my house.

Don't be so disappointed, here's my tip for you. In electrical installations, every subcircuit should have its own RCD/GFCI circuit breakers installed, rather than relying a RCI/GFCI breaker on the higher hierarchy. Otherwise the circuit breakers are misconfigurated - failing to do so would make troubleshooting leakage current (read "tripping breakers") extremely difficult.

But when it happens, there's a solution. I once fixed one at my home that always trips spontaneously, the secret is using a leakage current clamp meter. A RCD/GFCI circuit breaker trips when the leakage current is greater than its threshold to protect equipment and people. With a leakage clamp meter, it allows you to monitor the leakage current in real-time rather than hopelessly trying in the dark. You just clamp the meter across BOTH the live and neutral wire on the main breaker (if both wires carry the same current in the opposite directions, i.e. no leakage, net magnetic flux inside the sensor will be zero, so the meter reads 0 mA, and vice versa), turn off every subcircuit, and starts switching on different subcircuits one-by-one. If you see the leakage current substantially increases after you switched on a specific subcircuit, repeat the process by disconnecting all appliances from that subcircuit and connecting them one-by-one while observing the meter. Finally you'll find the smoking gun (if there's still significant leakage when nothing is connected, it means the wiring is at fault), mine was a light fixture with damaged insulation.

A leakage current clamp costs $200-$300, Fluke's meter will cost you $600. However, you can get a good enough, low-accuracy leakage clamp from China on AliExpress for $50-$100, just make sure the product description mentions "leakage" (the laws of physics are the same for all current clamp meters, but most are not designed to sense sub-10 mA leakage current).

In my experience, only a handful of electricians working with home wirings are aware of this diagnostic technique. I learned it from an application note of an industrial system.


I'm an electrician. I was very confused for the first few paragraphs.


Seeing where the link was pointing to (Shopify), you could have expected it.

On the other hand, the same article from the Philips Engineering Blog... now that would be very confusing!


Now let me tell you about bulkheads...


Better to remain silent and be thought a fool...

This term has been around for quite a while. I don't think it's a very apt term, but it's the one that was chosen, so whateryagonnado.

Welcome to the 10,000.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: