Hacker News new | past | comments | ask | show | jobs | submit login
A Foolish Consistency: Consul at Fly.io (fly.io)
148 points by soheilpro on March 31, 2022 | hide | past | favorite | 69 comments



I have been burn by Consul countless time and advise to not actively using it.

Google around company has down time with consult

https://www.protocol.com/newsletters/protocol-enterprise/rob...

https://danveloper.medium.com/on-infrastructure-at-scale-a-c...

I had use consul extensively in 2016-2017 with consul template.

The problem with consul is it's very easy to reject connection. Accidently have a node with too old or too new version join and entire cluster will be down. Which can happen easily in an autoscaling env.

Another issue is again, agent takes a lot of time to be cleanedup and removed, say in autoscaling vm where node come and go all the time you ended up with so many dead node on consul ui. Again, this is 2017 so It might changed now.

I have been replace consul with something suprisingly simple, and I think in retrospective that's all I want.

A local process run a key value API on top of a local database(can be SQLite or anything up to your taste. Its replace its state with a central server.

If the central server is down, the local agent still has its state to continue serving stale data.


Can't speak for current experience, but we've used it ca early 2016 - late 2017 on ~100 VMs on GCP and it performed flawlessly. No autoscaling though.


you autoscale the consensus system?


I take it as a general concept of asg, not just scaling. For example I have all instances in some autoscaling group to enable transparent replacements and make rolling updates easier. (Update the requested image, then just start killing the old ones)


in the case of the parent comment I think he means, literally, having them scale up and down. I mostly have experience with zookeeper and I treat them very very carefully.


Somewhat related question: Consul's deployment guide recommends against deploying a cluster across high latency networks, and suggests Federation instead. Did you had any issues caused by the high latency between servers and agents? Are all your Consul servers located in the same region?


They are, and we do have those issues; we address them somewhat towards the end of the post, with São Paolo and Tokyo --- but also with São Paolo and São Paolo: a VM that pops up in GRU has to post to Consul in IAD before other GRU nodes "see" it in Consul. We've addressed that problem by "hinting", sending service discovery updates over NATS that usually beat out Consul, and then reconciling the two event sources.

As I understand it, the reasonable-person thing to do here is to have federated Consul clusters in different regions. We run up against the problem that we're not hosting one application globally, but rather a zillion applications; every node needs to know about every instance of every application. So things like losing the global KV store are problematic for us.


> We run up against the problem that we're not hosting one application globally, but rather a zillion applications; every node needs to know about every instance of every application.

Have you considered nerd-sniping colmmacc? ;)


Something with “sharded consensus” like tikv or cockroach seems like a better fit based on this description


We actually don't have a consensus problem. Each "node" is its own source of truth and every other node can take what it says on faith. We _also_ need all the data in all the places.


The “consensus” in raft is simply a way to guarantee durability of writes (among other things) in distributed setup. In your case the producer node gets a guarantee that it did in fact commit its route to the db. Ofc you can achieve similar outcome with something like gossip + repeated announcements. It would probably depend on specifics which approach is more efficient


IAD? Are you hosting on another cloud provider or your own collocation?


We run our own servers. IAD is Ashburn, Virginia's airport code and there are a bunch of providers in Ashburn, Virginia.


TBH that's sound advice for pretty much every distributed system with a consensus protocol: high latency = slow consensus = slow change propagation.


Great write up. The perspective that hey maybe distributed consensus isn’t worth the overhead is a nice one to see.

If fly.io found the upper bound on a Consul scale out, what do you think a reasonable threshold looks like for a smaller system?


(HashiCorp Co-Founder)

First, I love this blog post and I tweeted about it, so please don’t interpret any of my feedback here negatively.

> If fly.io found the upper bound on a Consul scale out, what do you think a reasonable threshold looks like for a smaller system?

I don’t know the exact scale of Fly.io’s Consul usage, but I would imagine they’re far, far from the upper bound of Consul scale out. We have documented some exact scale numbers here[1]. And of course these are only the customers we can talk about.

I didn’t read this post as talking about scale limits. Instead, it discusses the tradeoffs of certain Consul deployment patterns, and considers whether their particular usage of Consul is the best way to use a system with the properties that Consul has. And this is a really important question that everyone should be asking about all the software they use! I appreciate Fly sharing their approach.

To answer your second part (what is a reasonable threshold), we have documented recommended hardware requirements and scale limits here: https://learn.hashicorp.com/tutorials/consul/reference-archi... This is the same information we share with our paying customers.

[1]: https://www.hashicorp.com/case-studies


I don't think we're anywhere close to the limit of Consul's ability to scale, but I think we're abusing Consul's API. If I had to pinpoint an "original sin" of how we use the Hashistack here, it's that we need individual per-instance metadata for literally all our services, everywhere. I can't imagine that's how normal teams use Consul.


This is true. We did not have scaling issues with Consul.

It scaled very well from "nugget of an idea" to "whoah maybe this is going to be a big company".

I think the best software buys time for the users. Consul bought us years.


bug company -> big company


I worked at a place that ran Federated Consul in 60 DC's across ~40,000 machines running consul-agent. Originally, the largest DCs had about 8,000 nodes which caused some problems that we had to work through. But I'm of the thought that you shouldn't have 8,000 of anything in a single "datacenter" without some kind of partition to reduce blast radius.


How many people were dedicated to keeping that configuration running?


> To answer your second part (what is a reasonable threshold), we have documented recommended hardware requirements and scale limits here:

That's (really) good documentation, but doesn't directly address the Fly.io situation nor my situation: multiple data-centers in multiple jurisdictions around the globe.

> https://learn.hashicorp.com/tutorials/consul/federation-goss...

> To start with, we have a single Consul namespace, and a single global Consul cluster. This seems nuts. You can federate Consul. But every Fly.io data center needs details for every app running on the planet! Federating costs us the global KV store. We can engineer around that, but then we might as well not use Consul at all.

I think a better way to ask my question might be: Is there a threshold below which can we safely run Consul in a single global cluster like Fly.io tried before it got too unwieldy?


Purely as an ignorant outsider here, but now I've seen Roblox and Fly.io have either crippling outages or an inability to scale due to issues in Consul. It's not a good look.


Do you also blame guns when people shoot themselves in the foot when they kept it loaded and the safety off?


Fly.io committed a bug fix back to Consul, and Roblox’s 3-day outage was due to flaws in Consul streaming.


Consul has been great for us. It handled an unreasonable problem we threw at it far better than we had any right to expect.


A 3 day outage is never a product bug. Its a failure to plan for DR and many other things.


Agreeing with mitchellh's sibling comment. This post doesn't seem to discuss a scaling problem really (there's an aside about an issue with n squared messaging, but it was fixed, so no big deal). The point of the discussion seems to be that consensus isn't really needed for their use case, and it makes things complicated, so maybe get rid of it.

Really, a node doesn't want a consensus on what services are available, it wants to know what services are available to it. Hopefully that's the same globally, but waiting for global consensus means sending more traffic to the wrong place than if you had full mesh announcements (probably not great at scale) or announcements sent to a designated node(s) in each datacenter (possibly gossiped from there to each node, if desired, could be looked up on demand from the monitor node). Of course, that works only if you're ok assuming that a service reachable by one node in a datacenter is reachable by all nodes in a datacenter.

Anyway, you still may want to deal with requests that were sent to the right place, but are received by the wrong place, because the right place changed beyond the observation window for the routing decision. (This is also touched on in the article).

Distributed systems are fun and exciting! And, there was another post today noticing that even a single computer is really a distributed system (and they have been for quite a while)


Fly.io reported that they couldn't scale their deployment of Consul (single global cluster) beyond a (tremendous) level because it became unwieldy and they needed to work around that.

When I referred to a 'scaling problem', I was referring to scaling that specific Fly.io architecture. The bad one, the one without federation. I'm interested in the safe operating limits for that configuration precisely because Fly.io seemed to make a really good go of it for a long time.


> The point of the discussion seems to be that consensus isn't really needed for their use case, and it makes things complicated, so maybe get rid of it.

I mean, it isn't NY Times style lede buried in a wall-of-text 10-feet deep. It is right in the title, A foolish consistency... (:


"With me so far? Neither am I."

Ok, look, I know CAP and distributed databases are hard, but this article is presumably about to go into some deep wading, and it basically does the Barbie "Math is Hard"? Come on.

"Consul is designed to make it easy to manage a single engineering team's applications. We're managing deployments for thousands of teams. It's led us to a somewhat dysfunctional relationship."

Um, I don't know consul's gotchas because I am a Cassandra guy, and a demi-Dynamo guy. Yes yes, I know any problem that is attempted to be solved by cassandra fits the joke "well, now you have two problems". But it does scale.

Why is raft needed for a simple quorum write for a key-value <service name> = <ip> ?

Thousands of teams does not sound like a hard database scaling problem. How often is this read? How often is this written?


It's thousands of teams with consistency around the entire world, updated and read in real time as the control plane of a CDN.

I didn't go into detail about Raft because it isn't the point of the article, and because I don't want people checking out because they feel like they need to grok distributed consensus to understand what we're saying. If that means we lose people with strong opinions about Raft, I'll accept the tradeoff.


Makes sense about not wanting to go boating.

I've been deep into distributed consensus the past year—working on a 1 million TPS replicated state machine called TigerBeetle [1]. I would have enjoyed even more detail, only to understand the problem better, but I loved this line: "With me so far? Neither am I."

[1] https://github.com/coilhq/tigerbeetle


>the other popular ones are Etcd, which once powered Kubernetes,

I don't really follow k8s, but is etcd no longer required for it?


Also wondered this - Pretty sure it is unless you’re using K3s et al which swap it out for SQLite.


Correct, it's only the lightweight k8s distributions which don't need etcd


I blame Kurt for getting me to hedge that sentence, which originally read differently. :)

(But if you liked this article at all, I also blame Kurt for the feedback that got it from the tangled mess it was to the slightly-less-tangled mess it is now).


Definitely still etcd in there.


Is assigning multiple nodes the same routable address and using BGP to route to the closest a thing anywhere else? Is there anywhere I could read more about it? It seems (to my naïve mind) like something that could have some impressively annoying failure modes but I'm sure the engineers have thought it through.


You could deploy your own: https://github.com/google/seesaw | how-to: https://render.com/blog/how-to-build-an-anycast-network

Or have someone else deploy it for you: https://www.vultr.com/docs/configuring-bgp-on-vultr/ | https://netactuate.com/anycast/

All big cloud providers have anycast load balancer offerings. In my experience, Fly's load balancer is wayy more simpler than those to use (because it is the only way to load balance, like with Cloudflare).


You can bring your own AS quite easily at a couple of places such as vultr. I heard TCP works surprisingly well over anycast as routes are more static than one would fear.

Note that in the case of fly you don't directly get traffic to an anycast IP but through a reverse proxy.

I could imagine that they replicate connection states to all pops so that traffic that arrives in the wrong pop can be rerouted.


It's really common for CDNs. Definitely for DNS and almost always for http. Anycast routing.

In the past people were concerned more about route flapping disrupting TCP connections. It seems that people aren't worried about that anymore. Google does it.


> Anycast routing

Aha, that brings up technical articles, thanks. TFA does mention it multiple times so mea culpa.


The post mentions they are moving off nomad, interested to learn why and what is the replacement.


There are multiple reasons, some simple and some subtle. We don't totally have the bin-packing problem Nomad is designed to solve. We do have "users have fussy and meticulously specified ideas of how they want their apps to deploy", which is not really what Nomad is designed for; Nomad is designed to decouple your pool of compute from your applications, and to make it low-drama to deploy new applications to that pool of compute. But our users come with some level of drama built-in; it's an impedance mismatch.

I like Nomad a lot. I think it's a better fit for what we're doing than Consul is (they're both great, but I think Consul is the bigger sticking point). You'd get different opinions from different people on our team, but I think Nomad's going to be around for awhile. But there are things we very much want to ship that we can't ship on Nomad.

I would 100% use Nomad again if I had a somewhat complicated app I needed to deploy and couldn't use Fly.io for. (I'd use Consul too, for that matter.)

Per the sibling comment: we are indeed writing a new orchestration system in Go, called `flyd`.


I'm curious what kind of things do you want to ship that you can't ship on Nomad?


The giant feature is "start a VM in less than 1 second". Followed by things like suspending/resuming VMs.

We also need precise placement. We have people who want to run 8 VMs in London, 3 VMs in Los Angeles, and one VM in Sydney. They also want to do rolling deploys that revert on failure. Nomad's scheduling is not that precise.

Databases too. Both Nomad and k8s have ways to do persistent workloads, but they rely heavily on the underlying infrastructure to "move" volumes between throwaway containers. We don't want that, we actually want something closer to old school VM provisioning than "orchestration".


> Both Nomad and k8s have ways to do persistent workloads, but they rely heavily on the underlying infrastructure to "move" volumes between throwaway containers

I did this with Nomad by writing a small API to "place" host volumes which Nomad can use for job placement constraints automatically. It works shockingly well for my use-case which I think is somewhat similar to Fly's? A smallish number of containers that require host persistent storage, and don't expect anything to move -- just disappear forever (persistent til' it ain't).


This is what we do now, basically. We have a Nomad Device Plugin and create local LVMs. We also have plumbing to move these between hosts. It works pretty well, it's just very limiting.

What we need is to be able to target machines to specific volumes. And for things like database clusters, we need to control restart order during upgrades.


> The giant feature is "start a VM in less than 1 second". Followed by things like suspending/resuming VMs.

Is this for scaling existing apps up (eg flyctl regions add ...) or for new deployments (flyctl deploy)?

I suspect the former (scaling existing) as artifact distribution and deployment parameters like health checks seem like they prevent subsecond deployments (even ignoring the time for initial artifact uploading/building). Although my view is probably quite biased by Nomad having a much better chance at solving the former than the latter!

(Also hi! I appreciate our chats in the past and feel free to reach out if you're willing to brain dump Nomad gripes on me. :) )


It's very different than Nomad! The basic model is direct "machine" management with very little orchestration. Machines are kind of like allocations, but more permanent.

Some machines operations are slow, some are fast. Create is slow, update is slow. Start and stop are fast. Future "suspend" and "resume" are very fast.

For the regions case, people use these by having a couple of machines per region they care about. Then they start and stop as needed. A stopped machine can be migrated between nodes, but we ensure it's ready to start quickly if we do that.

Some interesting things happen with this model. Create is slow, but we are able to make it fast _sometimes_ because we can target creates at hardware that already has the rootfs cached.

Updates don't work like nomad, we update machines in place for a rolling deploy. This lets us do things like pull an image in parallel, then stop the machine, update, and restart very quickly.

We're also doing things differently with health checks. We may not wait for an active health checks before sending a machine an HTTP request. Many workloads don't even care about health checks, they just want a VM running and streaming logs back as fast as possible.

There are features I could ask for in Nomad to do what we'd need, but they'd make it not Nomad. :D


Hiring? Sounds like a match for problems I find interesting.


Soon! We're polishing up a platform engineer role now. It'll appear on https://fly.io/jobs/


If you find these kind of DevOps problems challenging, I'm hiring. :D


> we are indeed writing a new orchestration system in Go, called `flyd`.

I know it's just a wild coincidence, but I couldn't help but be reminded of https://github.com/flynn/flynn


I'm friends with at least one Flynn person and I'm always wondering what they'd think of our weird architecture. There's a lot of interesting code in that repo!


I hope there will be a blog post about flyd! I love reading about what fly.io is doing.


> Rewriting a Rust program in Go is sacrilege, we know, but Go had the Consul libraries we needed, and, long term, that Go server is going to end up baked into our orchestration code, which is already in Go.

This line from the post indicates to me that they're writing it themselves (since they wouldn't be worrying about being baked into orchestration code if it's Consul and Nomad). Still, I'd also be interested in what they're doing.


Great post! I always learn something new from fly.io blog posts. I hope y'all succeeded in clearing out the hobgoblins.

As an Elixir developer, fly.io has been a tempting option because it has a lot of buy-in in the community, but counterintuitively posts like this make me question if fly.io is really for me and my relatively small and local apps. Apps with a global presence requiring low latency everywhere, as this post shows, invites a lot of technical complexity. Is fly.io mostly for the "big dogs" then? Is it still a good, affordable solution with nice developer UX if you only ever intend to deploy to a single region?


Believe it or not, pretty much every cloud that runs full stack apps has a similar proxy / service discovery layer. Even if they only let you run in one region.

If you run your app on your own servers with your own load balancers and terminate your own TLS, you avoid a lot of this infrastructure. If you use anyone with a "cloud" load balancer, you get almost exactly what we've described here.

Complexity is a legitimate reason to manage your own server. A single server with a single database and nothing between you and the users avoids a lot of complexity.


I don’t have a take on if it’s a good fit for single region hosting (though the tooling makes that tremendously simple).

I will say that I think the days of single region deployments are coming to an end, even for “the little guys”. Regulation all over the world is demanding your application be region aware and the nature of the internet means your users will come from those jurisdictions.

If anything we need simple web stacks that acknowledge this and hosts (like fly) and data tiers that enable it trivially.

Sorry for the tangent-y rant.


i don't fully understand the argument that breaking up the big jsons into the kv tree wouldn't work. iirc, you can make wildcard style queries or at the very least efficiently dump the whole thing if you need...

overall though, the application sounds unique enough to where a custom directory/catalog/distributed load balancer may make sense.


The post mentions “hints” delivered over NATS several times. I’d love to read more about what the shape of that is.


When a VM comes up, we write a service to consul and simultaneously broadcast an identical NATS message. Nodes that are in a happy place get the NATS message, update their state, and then reconcile against Consul when Consul catches up.

NATS has the benefit of being p2p, so when a VM comes up in Sydney, the other Sydney nodes see:

- The nats message in about 5ms

- Time passes (call it 1-10s when things are working well)

- Consul data includes new entry

We also use "hints" to broadcast things that we can afford to never see. Load on VMs is an example. When a VM hits its max concurrency, we broadcast a NATS message so other nodes can take that into account for load balancing.


Do you view Consul as authoritative once it arrives if it clashes with the hint?


It's complicated; there's some reconciliation logic that considers the type of update and the timing of the hint vs. the Consul status. When a VM startup hint lands, beating Consul, Consul will continuously assert the non-existence of that VM until the whole Raft dance completes and the API polling picks it up --- so, effectively, the whole point of the hints is not to let Consul be authoritative for some data, for short periods of time.

It's super gross!

The basic use case here is: you start or stop a VM in Tokyo; our Consul Server cluster is halfway around the world, and every other host in Tokyo needs to wait for Consul to update to see that change, even though it happened right next door. With the hints, the change percolates instantly, and eventually Consul catches up and we reconcile.


Mostly yes. There are some races, particularly when services are deleted. A delete hint is actually authoritative.


Awesome. Thanks!


This seems like a good use-case for the Google Spanner database.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: