Fly deployments have been "down" (partially / fully) for a couple days per their status page.
After all of the recent talk about moving from Heroku to Fly, I was surprised to have all of these operational issues when doing my initial sniff test.
Can anyone testify to the production worthiness of Fly, or should I look somewhere else?
Edit: Before anyone says "editorialized title", when I posted this, the title of the linked page literally said "Deployments are broken".
We've had an issue with our Consul/Nomad since last night. This was preventing new deploys, and also preventing rescheduling peoples' VMs when they crashed. Not good!
This did not affect running apps (unless they crashed and needed rescheduling).
This kind of event is super rare. I think this is the second outage of this scale we've had in the last three years.
The Consul/Nomad deploy infrastructure is the most brittle part of our stack. We are working to replace this. New Postgres DBs don't use it at all, but it'll be a few months before all apps are off.
While we're still relying on Consul/Nomad, there's a chance this will happen again. But the way these tend to work is things break when we cross some capacity threshold. We get that fixed and it buys us time to discover the next capacity threshold.
Also, we _aggressively_ update our status page. It's not really an indication of our reliability relative to other providers. You need to read each individual incident to get an idea of what the effect was. Earlier this week we had an issue where new apps couldn't get new IPv4 addresses that lasted about 45 minutes. That's not awesome, but it's not the same scale of problem as we dealt with last night either.
Other status page entries are "a host in a particular region failed". This is entirely normal, and something we're going to deal with forever.
After all of the recent talk about moving from Heroku to Fly, I was surprised to have all of these operational issues when doing my initial sniff test.
Can anyone testify to the production worthiness of Fly, or should I look somewhere else?
Edit: Before anyone says "editorialized title", when I posted this, the title of the linked page literally said "Deployments are broken".