Great write up. The perspective that hey maybe distributed consensus isn’t worth...

mitchellh · on March 31, 2022

(HashiCorp Co-Founder)

First, I love this blog post and I tweeted about it, so please don’t interpret any of my feedback here negatively.

> If fly.io found the upper bound on a Consul scale out, what do you think a reasonable threshold looks like for a smaller system?

I don’t know the exact scale of Fly.io’s Consul usage, but I would imagine they’re far, far from the upper bound of Consul scale out. We have documented some exact scale numbers here[1]. And of course these are only the customers we can talk about.

I didn’t read this post as talking about scale limits. Instead, it discusses the tradeoffs of certain Consul deployment patterns, and considers whether their particular usage of Consul is the best way to use a system with the properties that Consul has. And this is a really important question that everyone should be asking about all the software they use! I appreciate Fly sharing their approach.

To answer your second part (what is a reasonable threshold), we have documented recommended hardware requirements and scale limits here: https://learn.hashicorp.com/tutorials/consul/reference-archi... This is the same information we share with our paying customers.

[1]: https://www.hashicorp.com/case-studies

tptacek · on March 31, 2022

I don't think we're anywhere close to the limit of Consul's ability to scale, but I think we're abusing Consul's API. If I had to pinpoint an "original sin" of how we use the Hashistack here, it's that we need individual per-instance metadata for literally all our services, everywhere. I can't imagine that's how normal teams use Consul.

mrkurt · on March 31, 2022

This is true. We did not have scaling issues with Consul.

It scaled very well from "nugget of an idea" to "whoah maybe this is going to be a big company".

I think the best software buys time for the users. Consul bought us years.

chrisweekly · on March 31, 2022

bug company -> big company

oceanplexian · on March 31, 2022

I worked at a place that ran Federated Consul in 60 DC's across ~40,000 machines running consul-agent. Originally, the largest DCs had about 8,000 nodes which caused some problems that we had to work through. But I'm of the thought that you shouldn't have 8,000 of anything in a single "datacenter" without some kind of partition to reduce blast radius.

politician · on March 31, 2022

How many people were dedicated to keeping that configuration running?

politician · on March 31, 2022

> To answer your second part (what is a reasonable threshold), we have documented recommended hardware requirements and scale limits here:

That's (really) good documentation, but doesn't directly address the Fly.io situation nor my situation: multiple data-centers in multiple jurisdictions around the globe.

> https://learn.hashicorp.com/tutorials/consul/federation-goss...

> To start with, we have a single Consul namespace, and a single global Consul cluster. This seems nuts. You can federate Consul. But every Fly.io data center needs details for every app running on the planet! Federating costs us the global KV store. We can engineer around that, but then we might as well not use Consul at all.

I think a better way to ask my question might be: Is there a threshold below which can we safely run Consul in a single global cluster like Fly.io tried before it got too unwieldy?

datalopers · on March 31, 2022

Purely as an ignorant outsider here, but now I've seen Roblox and Fly.io have either crippling outages or an inability to scale due to issues in Consul. It's not a good look.

ikiris · on March 31, 2022

Do you also blame guns when people shoot themselves in the foot when they kept it loaded and the safety off?

datalopers · on March 31, 2022

Fly.io committed a bug fix back to Consul, and Roblox’s 3-day outage was due to flaws in Consul streaming.

tptacek · on March 31, 2022

Consul has been great for us. It handled an unreasonable problem we threw at it far better than we had any right to expect.

ikiris · on March 31, 2022

A 3 day outage is never a product bug. Its a failure to plan for DR and many other things.

toast0 · on March 31, 2022

Agreeing with mitchellh's sibling comment. This post doesn't seem to discuss a scaling problem really (there's an aside about an issue with n squared messaging, but it was fixed, so no big deal). The point of the discussion seems to be that consensus isn't really needed for their use case, and it makes things complicated, so maybe get rid of it.

Really, a node doesn't want a consensus on what services are available, it wants to know what services are available to it. Hopefully that's the same globally, but waiting for global consensus means sending more traffic to the wrong place than if you had full mesh announcements (probably not great at scale) or announcements sent to a designated node(s) in each datacenter (possibly gossiped from there to each node, if desired, could be looked up on demand from the monitor node). Of course, that works only if you're ok assuming that a service reachable by one node in a datacenter is reachable by all nodes in a datacenter.

Anyway, you still may want to deal with requests that were sent to the right place, but are received by the wrong place, because the right place changed beyond the observation window for the routing decision. (This is also touched on in the article).

Distributed systems are fun and exciting! And, there was another post today noticing that even a single computer is really a distributed system (and they have been for quite a while)

politician · on March 31, 2022

Fly.io reported that they couldn't scale their deployment of Consul (single global cluster) beyond a (tremendous) level because it became unwieldy and they needed to work around that.

When I referred to a 'scaling problem', I was referring to scaling that specific Fly.io architecture. The bad one, the one without federation. I'm interested in the safe operating limits for that configuration precisely because Fly.io seemed to make a really good go of it for a long time.

ignoramous · on March 31, 2022

> The point of the discussion seems to be that consensus isn't really needed for their use case, and it makes things complicated, so maybe get rid of it.

I mean, it isn't NY Times style lede buried in a wall-of-text 10-feet deep. It is right in the title, A foolish consistency... (: