Hacker News new | past | comments | ask | show | jobs | submit login
Migrating from AWS to Fly.io (terrateam.io)
252 points by sausagefeet on Jan 3, 2023 | hide | past | favorite | 179 comments



Fly is great. I know the Fly team has an aversion to it but I really just wish they would hire some database folks and take on managed Postgres. I'm already running 8+ apps on there and it would be the peace of mind needed to move the rest across… one day.


The lack of “managed” Postgres is the only reason we haven’t moved 100% to fly.

The idea of having to do my own upgrades by updating the docker container scares me.

I have literally never run a stateful service off docker and don’t see the point, thanks to AWS.

RDS has been extremely stable and boring. Maybe it’s out together by similar duct tape under the hood but it’s been incredibly solid for nearly a decade across multiple database stacks.


What stops you from using a managed database from a provider who specializes in it?

If Fly builds it, it will create another Goliath who tries to eat everything.


> What stops you from using a managed database from a provider who specializes in it?

A lack of integrated security and authorization? Should be solvable though.

The expense of external traffic? The latency to a DB in a different datacenter?


That latency was the deal breaker for me. Kind of defeats having your app running on metal near your customers if they're in Dallas and the DB's in Ohio


Network posture? I can’t imagine trying to put the public internet in between my app servers and db.


Doesn't fly.io have tailscale integration?


You still need to go over a public network (as opposed to a network inside a data center.)


what makes a network private?


An RFC 1918 address range and an implementation that does not transmit data over infrastructure not owned by the network operator.

I think I know what you're getting at, and if I'm right, this isn't really a subjective or squishy subject.


The point the OP wanted to make was:

Servers sitting less than 1km to each other tend to have lesser latency between them, compared to servers that could be anywhere on the public internet.


Ah because I am interpreting it more at the logical level and about them venturing out into unknown territories (the packets are traversing routers I do not control), rather than just latency. Since others have mentioned that latencies are very low if you locate in the same region.

But if you have a zero-trust architecture and everything communicates over wireguard, then technically public or private won't matter right?

Not saying public/private IPs don't matter – since almost all the servers would still only have private IPs.

In a sense this is more like VPC peering.


With IPv6, you can have all public IPs (the same as early days of IPv4). However, if you have apps that are tightly coupled and often with strict latency requirements such as Web <-> DB, then it's better to communicate only over private links. You can control latency and you have better reaction time when something breaks down. That is in addition to security concerns.

I remember a failure of one transatlantic line that was used for one direction of packets we were sending between EU and US. I spent a night on a phone trying to convince AWS and our other datacenter operator to work around the problem by changing their routing and to push the backbone operator to fix the problem - we didn't have any relationship with the backbone of course.

You don't want such disruption to happen in a core part of your app. These disruptions can also be intermittent and "random" and you have no way to fix.


I wonder if we're moving to an unbundling phase of the cloud.

The only reason I think we might not be is the selling point is less vendor lock-in, not something managed significantly better. For the most part, the major cloud providers have mature, feature rich product offerings for common use cases like DB, distributed queues, running services, load balancers, etc.


I hope we're unbundling! RDS is fine, but it doesn't really offer me (as an app dev ... sometimes) much. PlanetScale is a better MySQL: https://planetscale.com/blog/introducing-planetscale-boost


Are you guys looking at a partner program where a lot of the managed postgres companies could also provide managed postgres but on top of your infra? (like they do with GCP/AWS/etc)


This is an advantage of the big public clouds. Any SaaS you could want is deployed into the same region as your application, so it's essentially co-located and on the same playing field as one of the native public cloud services.

With Fly.io (at least today), this isn't as straightforward given your favorite database is likely not deployed there.


Even at the minimal scale, one is going to have quite a few app instances all over the world sooner or later. In such setting, a notion of regional proximity of any kind quickly becomes irrelevant.

Higher database round-trip times are fixed by using database stored procedures instead of making multitudes of tiny compound SQL queries.


Not everything can be represented by a single database query. Often there is mixing involved using multiple sources, which break down the one query for all results pattern, and thus encourage co-locating the database near the tier making the db requests.


Not really. I see fly competing with Workers for smaller services, and with KV and D1, managed edge DBs are so easy it’s hard to live without them. Very different use case from a traditional managed DB - because it’s either all scaling and running in fly or it isn’t.


I’m planning on moving to fly.io, and I’m exploring cloud-based databases. Specifically Planetscale and CockroachDB.



I'm not completely averse (we've done this before!), it's just not how I think the world should work. If we hold out, and the future Fly cloud has a better-than-RDS Postgres option, it will be worth it. If we build our own managed Postgres, no one better is going to bring their stuff to our platform. Same with MySQL.


> I'm not completely averse (we've done this before!)

For those out-of-loop: https://news.ycombinator.com/item?id=7382151


I'd trust a first-party managed postgres built by fly more than a third-party one.


I like their strategy of outsourcing what they’re not best at to the best. They partnered with upstash, and now they’ve got a new feature almost immediately. Sure there’s a lot of vetting, but wrapping products is common.

I’d be happy if they can wrap Neon’s serverless postgres service into their networks somehow.


Neon CEO here.

We are working on becoming first party service on several major problems now. We will be happy to provide this for Fly. Kurt and I have been chatting.

Fly wants cross region replication which is coming soon. Once it's there we can integrate.

With that Neon is still on AWS and can be close to Fly, but not run on Fly servers. It's relatively straightforward for us to run on Fly, but the S3 part will still be Amazon.


This sounds like a match made in heaven.


Fantastic, nikita! Hope all goes well.


Would love this also.

The best option I could find was using Digital Ocean Managed Databases. The cheapest costs $15, but you can host multiple databases with good insights and backups. You can choose where you host it, and place it close to the region where youre Fly.io apps are, with low latency.

Only caveat is there isn't an easy way to automatically update the IPs whitelist of the database, to support the Fly builder and deployed app(s).


I recently switched to fly.io + crunchybridge. Crunchy is super easy to set up and their cheap plan ($10/month) is quite performant too.


The best value option I've seen so far is Azure SQL Server databases. Not exactly Postgres, but it's only $5/month and works wonders.


No where do I see mentioned how much was actually saved in dollar figures. What are we talking here—thousands, hundreds? Similarly, I'd be interested in knowing engineering effort involved in the migration.

As a software consultant focusing on designing, building, and deploying software on AWS, more often than not, the infrastructure cost issues I've seen have less to do with the underlying infrastructure provider — AWS in this case — and more to do with the application itself that's being deployed. Most recently, we were able to shave off 50% of a client's bill that was using Lambda for compute. The issue? Several, including sleep timers (on a service where you pay by the millisecond!) as well as pathological code (i.e. consume message from queue, enqueue same message, rinse and repeat).

Yes. You can save money by transitioning to another provider. But I'd start with reviewing the underlying code and architecture first


The biggest $ cost I'd have on my radar WRT AWS vs Fly is complexity and need-to-know of employee time to do any given thing. Fly is way easier to navigate and use than AWS and its labyrinth of Cloud Scale™ horrors. The trade-off being you can do more given things with AWS.


>Fly is way easier to navigate and use than AWS and its labyrinth of Cloud Scale™ horrors

Give fly.io a couple years and it will end up forced to take on all the complexity and edge cases AWS and Azure have to deal with today.

A shiny new codebase is always nicer than an old one, because the old one has had to accommodate all those pesky customer needs.


Did that ever happen to Heroku? Ignoring 10% of use cases for their target customers and making the 90% doesn't sound crazy to me.


You think it's inevitable? I think it's unlikely, they probably run into a different kind of targets. Those "edge cases" horror seem unnecessary like permission hell, pricing model hell (and too many size)


Yeah, this. AWS enables you to do anything you can think of, at any scale and complexity, given enough time spent reading their documentation and crafting terraform configurations.

Fly makes it really easy to do the 80% of things that most small/medium operations need to just get done now.


I can't wait for Fly to disrupt the market of Cloud Native Eldritch Horrors™

Their SLA's a bit too iffy for my use case atm but I wish them well in this noble endeavor.


I've been running a hobby project on their free tier for a while and it's mostly cool.

Wrote up some stuff back then: https://f5n.org/blog/2022/trying-out-some-hosting-options/ but unfortunately none of my pain points/caveats seem to have been fixed

  * deploying (re)builds your container but doesn't even tag it locally, so if you just built one, it will build the exact same one again and on the other hand you are left guessing which hash is the container that the tool just built. 

  * if you start with your own Dockerfile the mandatory options in fly.toml are not described overly well, so a bit of fiddling

  * recently I redeployed and without any real change (just a new code build) it wouldn't deploy, I had to remove some of the settings in fly.toml that I had guessed at to make it work in the summer to make it work now - no changelog anywhere

  * if your app is coming close to some assumed limit on the services.concurrency (which is not documented well), again you need to guess and redeploy until it works
Maybe I've run into an edge case with a JVM app where 256MB of RAM is tight, but my problem is more that all the feedback I get from the tooling is a little meh and doesn't make me confident to ever try production workloads here. Not complaining about the free service though and the dashboard/metrics stuff seems nice.


I never got as far as running my Spring Boot application on fly.io. Killed after a few seconds. I’d love to know why. https://community.fly.io/t/java-app-is-killed-on-startup/837...


Mine's a Clojure app and I'm running it via `java -Xmx180m -Xms180m` and my fly.toml has a kill_timeout of 15 and in services.tcp_checks I now have interval 25000 and timeout 9000 - but as I wrote, it's a very basic app, anything JVM on 256MB has always (even years before fly.io) been hit or miss, sadly. Good luck!


Thanks! I may give it another go.


The article mentions this, but that’s strongly enough in my opinion. With fly you have no managed PostgreSQL. In my opinion, it’s not really comparable on cost to aws when Postgres is in the stack.

You do read about interesting hacks where someone will set up rds in a region that may be single digit milliseconds away from a fly region. then, presumably, you could put PG bouncer on a sort of bastion host that connects to the fly wire guard VPN. But obviously, there’s no guarantee that the latency will always be that good.


RDS works really well with Fly apps. We often see lower latency between Fly.io and us-east-1 in Ashburn than an app does spanning two availability zones in us-east-1.

But, realistically, we'd like a managed Postgres provider on Fly.io hardware. It's a much better developer experience, we need DBs in every region, and our private networking is pretty dang powerful. I think we're close, but we may need to get a little bigger before we seem relevant to them. We're weirdly closer to managed MySQL than Postgres.


Thanks for this, this is the info I was looking for. I'm thinking of using fly.io for some bigger projects and was considering the possibility of having data on AWS, so this is great to hear.


> But obviously, there’s no guarantee that the latency will always be that good

That guarantee already doesn't exist with RDS. The RDS master is in one AZ, your server may be in another (and if it isn't, well, RDS will fail over to another AZ eventually). The network latency between AWS AZs is usually good, but it can be arbitrarily high, up to full outages between AZs.

This is just a measure of degrees. Your application and/or business already has to handle some degree of network issues (aws network outages), so it's a tradeoff of whether the extra hops increase the chance enough to matter.


Isn't it usually a bad idea to have the database and application server in different data centers? It is / was my understanding that both the database and application server should ideally be on the same rack.


It's really just a matter of latency and bandwidth costs. If you can tolerate both, then keeping them in different data centers is fine. In fact, depending on your architecture, you may have to do this at some point anyway, even within the same cloud provider.


Does Fly allow to run unmanaged Postres easily enough? Or even semi-managed, as easy-to-provision nodes without redundancy where you yourself set up pgbouncer, replication, etc? For simpler cases it could very well suffice.


Fly Postgres is just a Fly.io app. You can see the source code for it right here:

https://github.com/fly-apps/postgres-ha

It has some direct `flyctl` integration (which is also open source), but it's not doing anything you can't do yourself if you want.


https://fly.io/docs/postgres/

It isn't "Managed Postgres," but the differences are minimal. RDS is ultimately a solution for people who look in the mirror and confidently say "you don't know how to run a database."


> RDS is ultimately a solution for people who look in the mirror and confidently say "you don't know how to run a database."

I think this is a terrible oversimplification and something tells me that you haven't had to deal with a complex database setup from an operations perspective. RDS reduces a huge overhead in terms of operations (ha, backups, upgrades and clustering being the first ones that come to my mind). Between RDS and running a database on a virtual machine(s) and manage it with, let's say, Ansible for providing the four aforementioned features I would chose RDS any day of the week.


> you haven't had to deal with a complex database setup from an operations perspective

I've often found it to be the opposite actually.

My experience is with RDS MySQL, and on that, RDS heavily restricts what you can do. Want to do partial replication? Nope. Want to install a database plugin? No access provided to do so.

Used to have a MySQL instance on EC2, but the rest of the team joined the cargo-cult of 'everyone else uses RDS, so it must be good'. I used to be able to grep replication logs to find problem queries, but RDS doesn't give you access to those. I used to use various I/O and CPU monitoring tools to help pinpoint bottlenecks in queries/performance, but you only get the few metrics RDS gives you (e.g. RDS only gives you aggregate CPU usage, not per-core usage).

Even stuff like killing queries gets annoying - standard MySQL GUIs typically issue a `KILL` statement, but you aren't given permission to execute that. RDS provides a workaround via a stored procedure, but that means you have to break into a console and remember the name of the SP.

Which leads to my next point - I think "managed" is a big misnomer. RDS is nothing like a truly managed DB with a DBA. AWS isn't assigning someone to optimise your tables, or won't help you look at your queries to see what can be done better. If something goes wrong, it's on you to fix it. IMO, RDS is more like a pre-configured database. It saves you from having to initially configure the database, and saves you from having to set up automated backups, HA etc.

My opinion is that, if all you need is a cookie-cutter solution, RDS is okay. If you need a complex setup, stay away.


Spot on. RDS is a solution for people who look in the mirror and say "I'd rather be working on other things than running a database."

(Disclaimer: I'm a TAM for AWS.)


In case it needs saying: Fly.io agrees with this! We didn't build our Postgres feature as a statement about the utility of managed Postgres; it's a statement about our size relative to AWS. :P


...or their primary job has being dealing with a complex database setup and hasn't had to juggle that along with intensive application design.

Count me as someone who confidently doesn't know how to run a database and doesn't really care to. At least not at a level where someone would hire me to do it in production.


I look in the mirror and confidently say I don't know how to run a database. Was the grandparent comment meant to be an insult?


It was clearly intended to dismiss managed databases altogether, so yes.


I didn't interpreted as such but as an oversimplification and a sign of lack of operations experience as I stated at the beginning of my comment.


I don't think it's an insult. I don't know how to run a database; I want a managed service.


> you don't know how to run a database.

Hah, I wish you didn’t need to know how to run a database to use RDS. So much is dependent on settings/parameter groups.


Do you have any [pointers to] recommended best practices?


> RDS is ultimately a solution for people who look in the mirror and confidently say "you don't know how to run a database."

In the same way that a RDBMS is ultimately a solution for people who look in the mirror and confidently say “you don’t know how to directly write to disc while guaranteeing the validity of relational data in spite of concurrent writes, power failures, etc.”


Absolutely spot on comparison. For some people, RDS is the corner stone of their business. For others even a failed database can be rebuilt from logs within hours and it isn't business impacting.

Use the tools that make sense, but don't be afraid to pick the right tool.


What if I know how, and simultaneously don’t want to?


> RDS is ultimately a solution for people who look in the mirror and confidently say "you don't know how to run a database."

Which is hopefully virtually everyone whose full-time job isn't DBA.


I used to be very dba focused ages ago. Aws RDS for postgres I just love. Their RDS products must print money. Yes I could deploy myself. Yes I could use docker in various ways - I use docker for most app deploys. But but but - if you’ve been down the rabbit hole of scaling a database - or backing up, updating, securing etc its a no brainer - and aws let’s you start small


Author here. Some folks wanted the details on pricing.

Pre migration, the Terrateam AWS monthly bill was about $200/mo. With Fly.io, we're paying around $90/mo.


61$/m for 1vCPU and 8GB ram? Try Hetzner next time, on bare metal, for cheaper.


You can get 1vCPU + 2GB RAM for 3.92 Euro/month at Hetzner. Those prices are hard to beat.


You can get 4 virtual ARM CPU + 24GB RAM + 200GB hard disk for $0/month at Oracle cloud. Nothing can beat that.


Any caveats? This sounds really great.


Oracle.

There is a reason why they need to have this offer to attract a few flies. Personally I would stay as far away as I can from Oracle's products.


say more for us uninitiated?


There are several reasons why Oracle has a poor reputation and should be avoided in business:

High licensing fees: Oracle is known for its high licensing fees, which can be very expensive for businesses. This can lead to a financial burden on companies, especially smaller businesses.

Poor customer support: Oracle has a reputation for poor customer support, with many users complaining about long wait times, unhelpful responses, and a lack of follow-through.

Complex software: Oracle's software can be difficult to use and understand, which can lead to delays and frustration for businesses.

Compatibility issues: Oracle's software is not always compatible with other systems and can cause problems with integration.

Poor security: Oracle has had a number of security breaches in the past, which can lead to concerns about data privacy and security.

Overall, Oracle's high costs, poor customer support, complex software, compatibility issues, and security concerns make it a risky choice for businesses.

Moreover, Oracle has a history of suing companies that it believes are using its software without proper licensing or permission. This includes sending out lawyer letters to companies that it believes are infringing on its intellectual property rights. Oracle has been criticized for its aggressive tactics, which some believe are designed to intimidate and bully companies into compliance.

I wrote this answer with the help of ChatGPT. I think the AI did learn Oracle’s bad side pretty well.

I will just add a link to an archive of the famous blog post that was thankfully deleted but that show that the reputation is not completely inaccurate: https://web.archive.org/web/20150811052336/https://blogs.ora...


1. Arm not x86 so for some it may be a big issue

2.Free tier oracle sometimes feels like you have a lower priority on the given resources, and network speed is less than on hetzner, but works fine now for a year. Wouldn't trust commercial things running on it though without a plan b


It's great, i use it for the at last 2 years. Btw you can have 4 machines for free (2x 1core x64 1GB ram each + 2x 2core arm64 12GBram each) the min hardisk is 50GB so 4 machines from the free 200GB. Free Networking traffic up to 10TB/M, 2 free 20GB oracle databases, 20GB free Object-storage.

Oh and FreeBSD runs on it (i think it's under partner-images)


No. There is even a lot more included: https://www.oracle.com/cloud/free/#always-free. 90% of the internet can run on the free tier.

I also really like the dashboard much better than AWS or GCP.


> 90% of the internet can run on the free tier.

Only, of course, it really, really can't, and their current offer is very much contingent on that not happening.


Last time I checked Ampere cores were not available in the region I wanted. Before jumping in, look first if the Oracle image for ARN is available in the region. This doesn't guarantee you can get cores, but without the image there are no cores that can boot that.


None. it's part of the free-forever tier.


Yeah but the latency on a shared vCPU has been very noticeable in my experience.

For anything serious you would want to run a dedicated vCPU which is several times more expensive (though still quite affordable)


Netcup has cheap plans with that. 10€/month


How much total time / effort did the migration take and when do you break even on the $110/month cost saving?


It took about a weekend all said and done. We're much happier with the Fly developer experience.

$200 -> $90 a month is peanuts. But we're 100% bootstrapped and frugal.


Can you elaborate on the problems you faced with stolon?

You mentioned there was a case where stolon didn’t failover. Have you created an issue on this for GitHub?

(Have been a user and contributor of stolon; hence curious to know)


Could you break down the costs of AWS?

From your post you say you needed a simple service plus a Postgres instance. I think you can have that for around $50 on AWS using a small EC2 and Aurora Serverless v2 (min $40/mo)


I don't have access to the bills anymore. Two AZs, RDS HA, bandwidth, and EC2 all added up. It's possible I could have cut down the bill.

The initial motivation to migrate was cost, but at the end of the day we're just happy with the Fly.io developer experience. It's a better fit for us.


I missed when this was added:

> if you configure your application to expose a Prometheus endpoint, those metrics will automatically show up on your Grafana dashboard

But that's such an amazing and simple idea to integrate observability. Love the approach.


Isn't this just how Prometheus works alongside Grafana? I've never used fly.io but I've used that exact setup with these two services.


I believe GP is commenting on fly’s integration. Fly comes with built in Prometheus scraping and prepopulated grafana dashboards for VM resources some connection/request metrics. I believe the value is that if you simply expose a prom endpoint, they’ll scrape it and you can use their grafana (or hook your own into their Prometheus).

Nothing special vs vanilla prom/grafana, but seamless “no-ops” integration vs DIY.


Correct. It's common for companies to have their own custom systems or specific agents, then build a very custom and limited dashboard on top.

Fly.io went with "you know grafana and prometheus? yeah, we'll do just that". And I think that's perfect.


Openshift does something similar (though you need to annotate your deployments). But yes, the simplicity is refreshing.


I'm not sure this is well-known enough yet - fly now supports scale-to-zero services using fly machines - https://fly.io/docs/machines/


I've been trying fly machines with terraform and running into one bug after another: https://github.com/fly-apps/terraform-provider-fly/issues/12... https://github.com/fly-apps/terraform-provider-fly/issues/12... https://github.com/fly-apps/terraform-provider-fly/issues/12...

I was really excited since fly seems dead simple, but I haven't gotten it running yet...


The Terraform provider was a one off project from the community. We haven't really focused on it in earnest yet, unfortunately. We ended up sponsoring it, but until we get Fly apps off Nomad I don't think we'll have the best terraform story.


Ah, well that explains it I guess.

Though, it's a "partner" provider and it's in your official github account, plus there's no notices saying it's not officially supported... it would probably help people if that were a bit clearer.


Yep, this is a good note. Things got fuzzy, because we were so impressed by the person who took it upon themselves to randomly build a Terraform provider for us that we hired them.


Sorry, not to use this as a support forum, but what exactly does it need? Just some provider-side bug fixes? Or are there Fly-side changes that need to be made before it will work? If it's something small I wouldn't mind poking at it.


Just provider bug fixes / developments. The APIs it use are pretty stable. The particular errors in those issues look like either badly formed API calls, or something weird with auth.

It also probably needs an update to use api.machines.dev. It is currently trying to connect over wireguard to hit our private API, because we didn't have a public API endpoint when it was built. This is probably brittle. :)


What's api.machines.dev? The documentation here says you need a VPN connection still https://fly.io/docs/machines/working-with-machines/

Can I just sub `https://api.machines.dev` in for the api hostname everywhere?


Yeah. api.machines.dev is a public endpoint for our machines api.

EDIT: See this comment for an example on how to make the provider use the public endpoint https://github.com/fly-apps/terraform-provider-fly/issues/42...

EDIT EDIT: I pasted the wrong link, sorry about that.


Thanks! Is there any documentation on the stable API? The link I posted has some normal REST stuff for machines, but nothing on other resources. I can't find any central API reference/etc in your docs.

Looking at the code, it uses GraphQL? I can't find any documentation on your GraphQL API either though, aside from a web editor that doesn't appear to be linked anywhere official.


As fly apps v2 becomes more stable the docs will continue to improve. Watch this space :). As for the graphql, it's a bit of a thorny situation. The graphql exists primarily to service our needs internally and as such has no guaranteed contract. Unfortunately, however, for some things that's the only API that exists. The autogenerated docs in the playground are pretty good actually though. That said, we are working on improving our rest api so more things will be available there.


Does scale to zero work with ordinary fly apps yet?


Probably. Depends. For a simple app should be straighforward.

I set up a machine - a few weeks ago this involved a few direct api calls using curl - maybe `flyctl` can do it all now.

Then I deployed on that machine a version of my app that gracefully shut down if no requests were received for 60s.

Works great. Shuts down after a minute, boots back up in under a second when a request comes in.

See the docs for details


I found the docs difficult and inconsistent when moving between flyctl and cURL when I tried playing with machines a few months ago.

I daresay they'll be wonderful once the team get the machine stuff working under flyctl fully.

Their generous free tier does make it easy to play and assess though.


No, ordinary Fly apps (what we call v1 apps) are managed by nomad. They don't scale to zero. You can create a machine based app and it'll scale to zero just fine, but you lose most of the magic of `fly deploy` right now.


Would have loved to read how much money they saved from this migration. The number's not there, unfortunately.


Author mentioned here that they went from $200/month on AWS to $90/month of Fly: https://news.ycombinator.com/item?id=34242923

Pretty tiny cloud costs either way, probably safe to assume they’re a small, early stage startup. But even still, it was likely a bad call to spend significant engineering effort on saving $110/month, which is a rounding error even for small startups. Hard to imagine that time couldn’t have been better spent on things that grow the business, like new features and bug fixes.


Definitely tiny but we're just overall happier with Fly. The migration in total probably cost us a weekend of work all said and done. We get a lot for free with Fly (observability, multi-region, etc.) so all in all a win.


Ah fair enough, that’s a pretty small amount of work!


I would read it as a 60% saving. That's a very good one. Considering that this is a recurring cost and not a one time and it's proportional to their sales. Aka, they now will be able to provide their clients twice the amount of contents for the same price. At least in theory.


My thought was more timing - there’s so much that’s URGENT for a young startup, is cost cutting really urgent, something you have to do NOW, when the current savings are just $110/month? Part of what allows startups to survive is ruthless focus on the work that that’s most important today, and saving $110/month seems unlikely to be that.

However, the blog post author responded, noting that this migration took only 1 weekend, and they like other aspects of Fly than just the cost. So fair enough, 1 weekend is a small investment, less than I expected, I would have guessed this took 2+ weeks.


I've migrated several services from GCP to Fly.io, and I've been very happy with Fly.

All of my apps are low-volume hobbyist web apps that run on a single machine, so I don't do any Terraform, k8s, or Postgres, so my use case is a little simpler than OP.

My biggest complaint has been with outages,[0, 1, 2] but that's gotten better. The original architecture meant that larger servers could evict other users' running servers, which they admitted was a bad idea.[2] I'm not sure if they've fixed that since, but I haven't seen it happen in a few months.

>The container logging solution provided by Fly.io is basic. It's easy to view logs with the Fly.io CLI and via the Fly.io dashboard. However, there's only a small window of logs that are kept, forcing you to create a remote logging solution either internal to your application or via a separate Fly.io application that ships logs to an external service. This is a piece of operational overhead that I'd like to see removed.

I very much agree. This is a big feature gap I feel when using Fly as well.

>Fly.io is not a support-first company. If this bothers you, then Fly.io is probably not for you.

>When emailing support, and email is the only option, it can take from hours to days to get a response. Sometimes they don't follow up. It does not feel like their level of support is on par with other cloud providers.

This hasn't been my experience. I go through the forum rather than email, but I typically see responses within hours, even on the weekends.

By contrast, when I reported bugs to GCP through their designated support channels, there was multi-day latency, and they'd just keep dismissing my report.[3]

[0] https://community.fly.io/t/cant-deploy-my-app-in-iad/7746

[1] https://community.fly.io/t/application-vms-down-without-any-...

[2] https://community.fly.io/t/app-stuck-in-pending-state-after-...

[3] https://stackoverflow.com/q/53410165/90388


You probably had a different community experience than recent folks. We've grown like 4x in the last 3 months and haven't kept up with community as well as I'd like (because hiring). And over holidays, we've had a difficult time with email support as well. It'll get better, but it's not great right now!


It's a breath of fresh air to see honesty about the good and bad from those at the company. Thank you for the transparency, I really appreciate it.


To avoid outages, just scale up the number of app instances to like 2 or 3 and choose a diverse geography for them.

Outages is not something specific to Fly. For example, if you have an app on Azure and it has only one instance then... well, you asked for it to go down from time to time.

Maybe not a big deal for hobby projects, but for business critical stuff one should always consider deploying several instances of the same app when possible.


Yeah, but going from 1 to N instances is a big jump in complexity for not much more uptime.

I'd rather have a host that can do around three nines and just accept that as my upper bound.


The key moment is not to have any persistent state in the app. App may use a database, that's ok, but the app itself should not store any persistent state on a hard drive.

Once that simple requirement is met, the scaling becomes easy.


Right, but keeping a database externally adds a lot of complexity.

My apps are self-contained in a single Docker container. Just Go, SQLite, and Litestream:

https://mtlynch.io/litestream/


I moved a rails MVP from heroku to fly.io.

Biggest negative so far: no easy way to reset the database [1]. I'd typically like to blow it away and re-create it frequently as I tinker with it e.g. adding/removing/renaming a column or changing a data type. Those small changes now require migrations, which are a fair bit more work (but not a huge problem tbh).

Biggest positive: ability to select a zone meant response time for Australian users was very fast compared to heroku; making a big difference to UX.

[1] https://stackoverflow.com/q/74664822/5783745


This is stuff where I highly recommend writing up helper scripts. I have one to pull in stuff from a fly Postgres DB, spin up my local docker compose setup, and then pump the data into there.

I think that's the real niceness here, a CLI that is pretty straightforward (not perfect by any means! but I can still wrap my head around it)


Sounds awesome/slick. If you know of any examples that can be shared would love to see them. My use of the fly CLI is rudimentary at best, and I don't use scripts at all, just (mostly basic) commands.


With all the talk of PostgreSQL, I'm curious if anyone has tried the offerings Fly.io has been pushing with SQLite (Litestream & LiteFS) in any serious capacity, and can comment on the experience.

Wondering whether they would be a viable alternative to Postgres in the typical "Database + Docker Container" architecture.


Litestream & LiteFS author here. Tailscale has been using Litestream in production for a while and blogged about it in April 2022[1]. I've had a number of folks contact me out-of-band and tell me they're using in production at a surprising scale.

As for LiteFS, it's still early beta stage so I don't know of many people using it in a production capacity. I would expect it to be another six months or year before confidence builds enough to see more regular usage.

[1]: https://tailscale.com/blog/database-for-2022/


> The main motivation to migrate was cost.

Is Fly.io really than much cheaper for hosting CDN/Postgres/Docker Container?


I moved a service that was costing me $300+/mo to Fly and this month it was like $15. Some months I fall under the minimum $5 billing threshold and it is just free.

They are the only provider that has nailed the killer feature: turn off shit while it isn't in use.


This. We have a similar situation with Azure services being gradually migrated to Fly. Stuff that costed us $200+ to do on Azure is just like $15 on Fly.


Yeah, I sometimes wonder if they bill me correctly. Their price is not that cheap but at the end of the day, the cost is lower than expected.


We saved about 50% of our monthly bill migrating from AWS to Fly.io. While Fly.io isn't cheap, it's certainly cheaper than AWS from my experience.


I saved about 25% from Heroku, but fly is much less reliable. I've had TLS certs expire without warning and redis has timeouts


I tested websocket on fly about 2 years ago, it seemed to have kinda weird characteristic. It's hard to explain because I can't dig it. It's like throttle + batch scheduling somehow (I don't know what it actually does though)

I tested Fly, DO app platform, Render. Ended up with DO (but it's meh, can't mount disk volume, can't deploy to a specific sha code git)

I'm going to [stress]test their wss:// again this year, I'm thinking to finally move a commercial one to fly (already have hobby apps there)


Did we diagnose the TLS certs expiring? This is most often something like a CAA record that fails invisibly. We're not very good at notifying you of these kinds of issues, but certs should always renew when the config remains good.

People who proxy through cloudflare hit this a lot, unfortunately.


CF is my DNS. Not sure if I had the proxy enabled. I was able to re-configure everything and it started working. I made a reminder to check around the 65 day mark to see if it successfully renews.

I suspect what may have caused the issue is I have multiple fly hosts on subdomains, but each subdomain had a wildcard certificate against the root (*.example.com).


> redis has timeouts

I see closed connections on low traffic redis connections - for now a dumb catch-and-reconnect is doing ok


This was the problem I was experiencing:

https://community.fly.io/t/redis-socketclosedunexpectedlyerr...

Taking this thread's advice, I started self-hosting on flyio and have only seen 1 timeout in ~2 months.


do they claim to warn about TLS certs and pending expiration? (i haven't found that in docs anyway)


With Heroku, I never had to think about expiring certificates. With FlyIO, my app was down for 4 hours and I only new about it because of 3rd party uptime monitoring.

Here are a few support issues with this problem:

https://community.fly.io/t/ssl-certificate-did-not-renew-aut...

https://community.fly.io/t/certificate-expired/9143

https://community.fly.io/t/ssl-cert-expired-and-did-not-rene...


Since the fly folks are here: I'd love the ability to trigger jobs as I can with aws batch. The ability to do `fly run -a journalize-batch --command="python3 upload.py"` is something I need right now and am in the process of setting up aws batch to do it.


You can probably hack together with machines, though we're not a great fit for big bursts of compute. "fly machine run" is fun.


I recently migrated some services from DigitalOcean to Fly and it was quite pleasant. My only issue is that routing to Fly from Southern California is awful.

My friend and I have different ISPs and yet we’re both routed to the east coast instead of the much closer Los Angeles edge server.

So our requests end up going to the east coast(edge), then back to the west coast (app), back to the east coast(edge), and then back to west coast (origin).

The latency adds up and makes requests to a nearby app almost as slow as to one hosted across an ocean.

I’m worried that other users will hit this and make hosting on Fly terrible for apps that require lower latency :/


This is uncommon. When did it happen? We had some cross country routing issues a few weeks ago. We can generally fix these if you don't mind sending us traveroutes.


Here’s the forum thread I made about the issue. https://community.fly.io/t/how-is-edge-routing-determined/96...

Traces and ISP involved are in there. Thanks for looking into it!

Edit: I just double checked and it's still the same route to the east coast as posted in the thread


Ah, thanks. We'll chase it down. We got a little behind over the holidays.


I really do appreciate it. Is there any way Fly could detect this? I know geo IP isn’t perfect but it seems like it could help you track this stuff by allowing you to compare edge location to the users location.


There probably is a good way to do this. Do you have thoughts on things we should do here? We have a code footprint in every region (and also in like 6 off-platform regions on other providers) for doing this kind of monitoring, and we're probably not putting it to the most effective use we can. Ideas would be most welcome.


This idea might be a bit too naive but it seems like it could help

1) Record the IP address and edge of connecting clients 2) Use an IP API to determine the rough location and ISP of clients based on their IP 3) Spatially query the edge closest to each client and graph/log serious mismatches (>x miles off) 4) Cast the spells required to route offending ISP better 5) See if the data improves :)

For example, the IAD edge would record that I connected from San Diego via Cox Communications. My closest edge geographically is LAX so that's ~2500 miles of excess travel!


This is cool, I'd love to know more about how you generally fix issues like these?


I usually just go complain to network folks and they figure it out. I'm good at complaining!

Routing issues like this are typically the result of some weird network provider politics. The fix is typically ti change how we announce IPs to force a route to do something we want. It's a dark art.


Awesome, thanks for sharing. I am also good at complaining! We might make good friends.


What ISP's are involved here?


Cox and Spectrum. They route slightly different but both to the east coast. I linked the forum thread above.


Nice post.

I also documented steps how to run Rails application if someone needs it: https://businessclasskit.com/docs/how-to-deploy-rails-sideki...


So it seems the 'machine' is the equiv of AWSs EC2? I think I would only save $5~ by not having to have HAProxy (cos AWS LB is exensive) but I would need to manage PG myself which seems like a hassle for a hobby project.


> Fly.io is not a support-first company. If this bothers you, then Fly.io is probably not for you.

> When emailing support, and email is the only option, it can take from hours to days to get a response. Sometimes they don't follow up. It does not feel like their level of support is on par with other cloud providers.

This enough to keep me on AWS. You can say a lot about AWS but their support is great


Responsive at least. They’re very quick to inform you that x is a known issue with their service, and the request has been dispatched to the dev team.


Yeah, I'm ok with "hours to days" of mysterious downtime for some personal projects. But for my day job? Deal killer for sure.


I found debugging setup errors on fly to be incredibly frustrating

You can't even ssh until the container started and they have some requirements on hostnames (0.0.0.0)

At least a few of their examples are broken.

The documentation could be better but it's pretty good

Is it better than Aws? Yes of course, you need to spend way more to get a dx as bad as Aws.

Setting up a VPS is a smoother experience than fly, though.

Once it's running, it's lovely and reminds me of heroku.


Help me understand what you'd be SSH'ing into before your VM boots up? (I'm responsible for most of our SSH goop, so if there's an improvement to be made, I'm happy to make it).

A reminder: we don't run "containers". We take containers, unpack them, and transform them into VMs. There's no host OS for you to access.


Caveats: 1) this may not be what grandparent was trying to do and 2) I have limited experience debugging "regular" Docker, let alone firecracker or whatever Fly.io does.

With that said, I succesfully deployed a simple Rust/Actix/Sqlite app with a Dockerfile to Fly.io. I then thought I'd try out litestream. For reasons I'm still not sure about, having `ENTRYPOINT ["litestream replicate … -exec myapp"]` resulted in immediate kernel panics.[1]

As I was debugging, my instinct was to want to `fly ssh console` into a running container to see if running the same command from the shell produced any clues for further debugging. Though I understand this is the wrong mental model, the thought was something like "well, I know everything else works, I just need Fly to ignore this failing command so I can have a minute to poke around." To do this, I ended up just removing litestream from the ENTRYPOINT so the deploy would succeed, then I could SSH in and play around at the shell to see what was going on.

Again, I have no idea whether this is the same sort of problem the other person was having, but for my case, what would be helpful is probably not changing how `fly ssh console` behaves, but perhaps some documentation of suggested debugging techniques in case your app is failing to start.

[1] Putting the exact same command into a "start.sh" and making that the ENTRYPOINT worked fine so that's what I ended up doing.


One of my biggest UX nits with Fly (I have no excuses, I have all the access I need to go fix this myself) is that we "kernel panic" when your entrypoint command fails. Of course, our kernel is not really panicking --- we've just run out of things for our `init` to manage, so it exits, and when pid 1 exits, so does the kernel. But you get the terrifying stack dump.

We can clean this up, so that you get a clearer, simpler error ("your entrypoint exited, here's the exit code, there's nothing else for this VM to do so it's exiting, have a nice day"), and it's been on the docket for months. We'll get it done!

We could conceivably add a flag for our `init` to hang around waiting for you to SSH in after your entrypoint exits. But that's clunky and complicated. Usually, you want your kernel to exit when your entrypoint fails, so that your service restarts! What you should do instead is push a container that has enough process supervision to hang around itself. Here's a doc:

https://fly.io/docs/app-guides/multiple-processes/


That all makes sense. I think context matters. Yes, when I have a service that's been up and running for some time, if that service fails, I want my service to restart.

But when I'm trying to get a service running for the first time and I'm not sure that I have the right command in the entrypoint, the right arguments to that command, or the right supporting files in place, or the right libraries installed, or the right file permissions, …, well, then I don't want things to just blindly restart, I want a handle and some information so I can figure out why it isn't working.

ETA: I recognize that your link to docs about running a supervisor addresses this problem. For me this raises some interesting questions. Like, I understand why Ben would implement `litestream exec` but maybe it would be better to steer users to a proper supervisor? Separately, what if it's the supervisor that's failing? Now I'm back to seeing kernel panics and not having error messages or a shell.


Usually, when I'm debugging a container, I start with a `tail -f /dev/null` entrypoint, or something like that, and then just shell into it to run the real entrypoint to see if it's working.


That's probably closer to sysadmins magic, not average developer way of thinking when debugging.


> We could conceivably add a flag for our `init` to hang around waiting for you to SSH in after your entrypoint exits. But that's clunky and complicated

This is a common thing in CI platforms, and the way they usually expose this is "run tests with SSH enabled", and they keep it open for 30 minutes/2 hours/whatever until a session closes.

So if I have some app failing, being able to run `fly restart --ssh-debug`, having that first just sit around waiting for the app to boot, and then dropping into ssh would be a very helpful piece of UX. The main thing is cleanup, but y'all charge for compute! You can be pretty loosy-goosy on that one honestly.


Maybe he means/wants some kind of “console” access for his “instance” ? Seems crazy as it takes I’m sure less than a second to boot, and if there is a problem booting up it’s probably not going to be a problem that a user/customer would be able to solve at console.


> You can't even ssh until the container started

In fairness, I'm struggling to think of a way they could get you to ssh to a container that hasn't started yet

> and they have some requirements on hostnames (0.0.0.0)

Can you clarify?


if the author says it doesn't take more than few hours + DX is really better, who are we to judge :)


For me, Fly.io is like a seminal step forward to an imaginary OS of the future: standardized, auto-clustered, just works, dead simple to grasp. Kind of Windows 95 of its time.


Since the fly folks are here, do you plan to support monorepo apps in a single fly.toml. i.e define all services in a single fly.toml like docker compose or render's blueprint?

Also a way to reuse the definition but run a different command. For example for a rails app, web worker (puma) and job worker (sidekiq) differs only in the command run. In fly, I would have to duplicate those. If there is an easier way, do let us know.


The second case has experimental support already: https://fly.io/docs/reference/configuration/#the-processes-s...

The first, we're not completely sure about yet. Probably someday, though.


I would love to try fly.io but unfortunately it doesn't support GPU workloads yet. Waiting eagerly!


Isn’t Fly.io hosted on AWS (they don’t have their own hardware).

Just like Heroku.


We have our own hardware.


That’s awesome to hear!

Thank you btw for all your funding/support of the Erlang/Elixir ecosystem.


Stay cool, Fly. You guys will rule the world!


No, I don't think they are. The things they do with Anycast IPs I don't think is possible on AWS.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: