Fly is great. I know the Fly team has an aversion to it but I really just wish they would hire some database folks and take on managed Postgres. I'm already running 8+ apps on there and it would be the peace of mind needed to move the rest across… one day.
The lack of “managed” Postgres is the only reason we haven’t moved 100% to fly.
The idea of having to do my own upgrades by updating the docker container scares me.
I have literally never run a stateful service off docker and don’t see the point, thanks to AWS.
RDS has been extremely stable and boring. Maybe it’s out together by similar duct tape under the hood but it’s been incredibly solid for nearly a decade across multiple database stacks.
That latency was the deal breaker for me. Kind of defeats having your app running on metal near your customers if they're in Dallas and the DB's in Ohio
Servers sitting less than 1km to each other tend to have lesser latency between them, compared to servers that could be anywhere on the public internet.
Ah because I am interpreting it more at the logical level and about them venturing out into unknown territories (the packets are traversing routers I do not control), rather than just latency. Since others have mentioned that latencies are very low if you locate in the same region.
But if you have a zero-trust architecture and everything communicates over wireguard, then technically public or private won't matter right?
Not saying public/private IPs don't matter – since almost all the servers would still only have private IPs.
With IPv6, you can have all public IPs (the same as early days of IPv4). However, if you have apps that are tightly coupled and often with strict latency requirements such as Web <-> DB, then it's better to communicate only over private links. You can control latency and you have better reaction time when something breaks down. That is in addition to security concerns.
I remember a failure of one transatlantic line that was used for one direction of packets we were sending between EU and US. I spent a night on a phone trying to convince AWS and our other datacenter operator to work around the problem by changing their routing and to push the backbone operator to fix the problem - we didn't have any relationship with the backbone of course.
You don't want such disruption to happen in a core part of your app. These disruptions can also be intermittent and "random" and you have no way to fix.
I wonder if we're moving to an unbundling phase of the cloud.
The only reason I think we might not be is the selling point is less vendor lock-in, not something managed significantly better. For the most part, the major cloud providers have mature, feature rich product offerings for common use cases like DB, distributed queues, running services, load balancers, etc.
Are you guys looking at a partner program where a lot of the managed postgres companies could also provide managed postgres but on top of your infra? (like they do with GCP/AWS/etc)
This is an advantage of the big public clouds. Any SaaS you could want is deployed into the same region as your application, so it's essentially co-located and on the same playing field as one of the native public cloud services.
With Fly.io (at least today), this isn't as straightforward given your favorite database is likely not deployed there.
Even at the minimal scale, one is going to have quite a few app instances all over the world sooner or later. In such setting, a notion of regional proximity of any kind quickly becomes irrelevant.
Higher database round-trip times are fixed by using database stored procedures instead of making multitudes of tiny compound SQL queries.
Not everything can be represented by a single database query. Often there is mixing involved using multiple sources, which break down the one query for all results pattern, and thus encourage co-locating the database near the tier making the db requests.
Not really. I see fly competing with Workers for smaller services, and with KV and D1, managed edge DBs are so easy it’s hard to live without them. Very different use case from a traditional managed DB - because it’s either all scaling and running in fly or it isn’t.
I'm not completely averse (we've done this before!), it's just not how I think the world should work. If we hold out, and the future Fly cloud has a better-than-RDS Postgres option, it will be worth it. If we build our own managed Postgres, no one better is going to bring their stuff to our platform. Same with MySQL.
I like their strategy of outsourcing what they’re not best at to the best. They partnered with upstash, and now they’ve got a new feature almost immediately. Sure there’s a lot of vetting, but wrapping products is common.
I’d be happy if they can wrap Neon’s serverless postgres service into their networks somehow.
We are working on becoming first party service on several major problems now. We will be happy to provide this for Fly. Kurt and I have been chatting.
Fly wants cross region replication which is coming soon. Once it's there we can integrate.
With that Neon is still on AWS and can be close to Fly, but not run on Fly servers. It's relatively straightforward for us to run on Fly, but the S3 part will still be Amazon.
The best option I could find was using Digital Ocean Managed Databases. The cheapest costs $15, but you can host multiple databases with good insights and backups. You can choose where you host it, and place it close to the region where youre Fly.io apps are, with low latency.
Only caveat is there isn't an easy way to automatically update the IPs whitelist of the database, to support the Fly builder and deployed app(s).
No where do I see mentioned how much was actually saved in dollar figures. What are we talking here—thousands, hundreds? Similarly, I'd be interested in knowing engineering effort involved in the migration.
As a software consultant focusing on designing, building, and deploying software on AWS, more often than not, the infrastructure cost issues I've seen have less to do with the underlying infrastructure provider — AWS in this case — and more to do with the application itself that's being deployed. Most recently, we were able to shave off 50% of a client's bill that was using Lambda for compute. The issue? Several, including sleep timers (on a service where you pay by the millisecond!) as well as pathological code (i.e. consume message from queue, enqueue same message, rinse and repeat).
Yes. You can save money by transitioning to another provider. But I'd start with reviewing the underlying code and architecture first
The biggest $ cost I'd have on my radar WRT AWS vs Fly is complexity and need-to-know of employee time to do any given thing. Fly is way easier to navigate and use than AWS and its labyrinth of Cloud Scale™ horrors. The trade-off being you can do more given things with AWS.
You think it's inevitable? I think it's unlikely, they probably run into a different kind of targets. Those "edge cases" horror seem unnecessary like permission hell, pricing model hell (and too many size)
Yeah, this. AWS enables you to do anything you can think of, at any scale and complexity, given enough time spent reading their documentation and crafting terraform configurations.
Fly makes it really easy to do the 80% of things that most small/medium operations need to just get done now.
* deploying (re)builds your container but doesn't even tag it locally, so if you just built one, it will build the exact same one again and on the other hand you are left guessing which hash is the container that the tool just built.
* if you start with your own Dockerfile the mandatory options in fly.toml are not described overly well, so a bit of fiddling
* recently I redeployed and without any real change (just a new code build) it wouldn't deploy, I had to remove some of the settings in fly.toml that I had guessed at to make it work in the summer to make it work now - no changelog anywhere
* if your app is coming close to some assumed limit on the services.concurrency (which is not documented well), again you need to guess and redeploy until it works
Maybe I've run into an edge case with a JVM app where 256MB of RAM is tight, but my problem is more that all the feedback I get from the tooling is a little meh and doesn't make me confident to ever try production workloads here. Not complaining about the free service though and the dashboard/metrics stuff seems nice.
Mine's a Clojure app and I'm running it via `java -Xmx180m -Xms180m` and my fly.toml has a kill_timeout of 15 and in services.tcp_checks I now have interval 25000 and timeout 9000 - but as I wrote, it's a very basic app, anything JVM on 256MB has always (even years before fly.io) been hit or miss, sadly. Good luck!
The article mentions this, but that’s strongly enough in my opinion. With fly you have no managed PostgreSQL. In my opinion, it’s not really comparable on cost to aws when Postgres is in the stack.
You do read about interesting hacks where someone will set up rds in a region that may be single digit milliseconds away from a fly region. then, presumably, you could put PG bouncer on a sort of bastion host that connects to the fly wire guard VPN. But obviously, there’s no guarantee that the latency will always be that good.
RDS works really well with Fly apps. We often see lower latency between Fly.io and us-east-1 in Ashburn than an app does spanning two availability zones in us-east-1.
But, realistically, we'd like a managed Postgres provider on Fly.io hardware. It's a much better developer experience, we need DBs in every region, and our private networking is pretty dang powerful. I think we're close, but we may need to get a little bigger before we seem relevant to them. We're weirdly closer to managed MySQL than Postgres.
Thanks for this, this is the info I was looking for. I'm thinking of using fly.io for some bigger projects and was considering the possibility of having data on AWS, so this is great to hear.
> But obviously, there’s no guarantee that the latency will always be that good
That guarantee already doesn't exist with RDS. The RDS master is in one AZ, your server may be in another (and if it isn't, well, RDS will fail over to another AZ eventually). The network latency between AWS AZs is usually good, but it can be arbitrarily high, up to full outages between AZs.
This is just a measure of degrees. Your application and/or business already has to handle some degree of network issues (aws network outages), so it's a tradeoff of whether the extra hops increase the chance enough to matter.
Isn't it usually a bad idea to have the database and application server in different data centers? It is / was my understanding that both the database and application server should ideally be on the same rack.
It's really just a matter of latency and bandwidth costs. If you can tolerate both, then keeping them in different data centers is fine. In fact, depending on your architecture, you may have to do this at some point anyway, even within the same cloud provider.
Does Fly allow to run unmanaged Postres easily enough? Or even semi-managed, as easy-to-provision nodes without redundancy where you yourself set up pgbouncer, replication, etc? For simpler cases it could very well suffice.
It isn't "Managed Postgres," but the differences are minimal. RDS is ultimately a solution for people who look in the mirror and confidently say "you don't know how to run a database."
> RDS is ultimately a solution for people who look in the mirror and confidently say "you don't know how to run a database."
I think this is a terrible oversimplification and something tells me that you haven't had to deal with a complex database setup from an operations perspective. RDS reduces a huge overhead in terms of operations (ha, backups, upgrades and clustering being the first ones that come to my mind). Between RDS and running a database on a virtual machine(s) and manage it with, let's say, Ansible for providing the four aforementioned features I would chose RDS any day of the week.
> you haven't had to deal with a complex database setup from an operations perspective
I've often found it to be the opposite actually.
My experience is with RDS MySQL, and on that, RDS heavily restricts what you can do. Want to do partial replication? Nope. Want to install a database plugin? No access provided to do so.
Used to have a MySQL instance on EC2, but the rest of the team joined the cargo-cult of 'everyone else uses RDS, so it must be good'.
I used to be able to grep replication logs to find problem queries, but RDS doesn't give you access to those. I used to use various I/O and CPU monitoring tools to help pinpoint bottlenecks in queries/performance, but you only get the few metrics RDS gives you (e.g. RDS only gives you aggregate CPU usage, not per-core usage).
Even stuff like killing queries gets annoying - standard MySQL GUIs typically issue a `KILL` statement, but you aren't given permission to execute that. RDS provides a workaround via a stored procedure, but that means you have to break into a console and remember the name of the SP.
Which leads to my next point - I think "managed" is a big misnomer. RDS is nothing like a truly managed DB with a DBA. AWS isn't assigning someone to optimise your tables, or won't help you look at your queries to see what can be done better. If something goes wrong, it's on you to fix it.
IMO, RDS is more like a pre-configured database. It saves you from having to initially configure the database, and saves you from having to set up automated backups, HA etc.
My opinion is that, if all you need is a cookie-cutter solution, RDS is okay. If you need a complex setup, stay away.
In case it needs saying: Fly.io agrees with this! We didn't build our Postgres feature as a statement about the utility of managed Postgres; it's a statement about our size relative to AWS. :P
...or their primary job has being dealing with a complex database setup and hasn't had to juggle that along with intensive application design.
Count me as someone who confidently doesn't know how to run a database and doesn't really care to. At least not at a level where someone would hire me to do it in production.
> RDS is ultimately a solution for people who look in the mirror and confidently say "you don't know how to run a database."
In the same way that a RDBMS is ultimately a solution for people who look in the mirror and confidently say “you don’t know how to directly write to disc while guaranteeing the validity of relational data in spite of concurrent writes, power failures, etc.”
Absolutely spot on comparison. For some people, RDS is the corner stone of their business. For others even a failed database can be rebuilt from logs within hours and it isn't business impacting.
Use the tools that make sense, but don't be afraid to pick the right tool.
I used to be very dba focused ages ago. Aws RDS for postgres I just love. Their RDS products must print money. Yes I could deploy myself. Yes I could use docker in various ways - I use docker for most app deploys. But but but - if you’ve been down the rabbit hole of scaling a database - or backing up, updating, securing etc its a no brainer - and aws let’s you start small
There are several reasons why Oracle has a poor reputation and should be avoided in business:
High licensing fees: Oracle is known for its high licensing fees, which can be very expensive for businesses. This can lead to a financial burden on companies, especially smaller businesses.
Poor customer support: Oracle has a reputation for poor customer support, with many users complaining about long wait times, unhelpful responses, and a lack of follow-through.
Complex software: Oracle's software can be difficult to use and understand, which can lead to delays and frustration for businesses.
Compatibility issues: Oracle's software is not always compatible with other systems and can cause problems with integration.
Poor security: Oracle has had a number of security breaches in the past, which can lead to concerns about data privacy and security.
Overall, Oracle's high costs, poor customer support, complex software, compatibility issues, and security concerns make it a risky choice for businesses.
Moreover, Oracle has a history of suing companies that it believes are using its software without proper licensing or permission. This includes sending out lawyer letters to companies that it believes are infringing on its intellectual property rights. Oracle has been criticized for its aggressive tactics, which some believe are designed to intimidate and bully companies into compliance.
I wrote this answer with the help of ChatGPT. I think the AI did learn Oracle’s bad side pretty well.
2.Free tier oracle sometimes feels like you have a lower priority on the given resources, and network speed is less than on hetzner, but works fine now for a year.
Wouldn't trust commercial things running on it though without a plan b
It's great, i use it for the at last 2 years. Btw you can have 4 machines for free (2x 1core x64 1GB ram each + 2x 2core arm64 12GBram each) the min hardisk is 50GB so 4 machines from the free 200GB. Free Networking traffic up to 10TB/M, 2 free 20GB oracle databases, 20GB free Object-storage.
Oh and FreeBSD runs on it (i think it's under partner-images)
Last time I checked Ampere cores were not available in the region I wanted.
Before jumping in, look first if the Oracle image for ARN is available in the region. This doesn't guarantee you can get cores, but without the image there are no cores that can boot that.
From your post you say you needed a simple service plus a Postgres instance. I think you can have that for around $50 on AWS using a small EC2 and Aurora Serverless v2 (min $40/mo)
I believe GP is commenting on fly’s integration. Fly comes with built in Prometheus scraping and prepopulated grafana dashboards for VM resources some connection/request metrics. I believe the value is that if you simply expose a prom endpoint, they’ll scrape it and you can use their grafana (or hook your own into their Prometheus).
Nothing special vs vanilla prom/grafana, but seamless “no-ops” integration vs DIY.
The Terraform provider was a one off project from the community. We haven't really focused on it in earnest yet, unfortunately. We ended up sponsoring it, but until we get Fly apps off Nomad I don't think we'll have the best terraform story.
Though, it's a "partner" provider and it's in your official github account, plus there's no notices saying it's not officially supported... it would probably help people if that were a bit clearer.
Yep, this is a good note. Things got fuzzy, because we were so impressed by the person who took it upon themselves to randomly build a Terraform provider for us that we hired them.
Sorry, not to use this as a support forum, but what exactly does it need? Just some provider-side bug fixes? Or are there Fly-side changes that need to be made before it will work? If it's something small I wouldn't mind poking at it.
Just provider bug fixes / developments. The APIs it use are pretty stable. The particular errors in those issues look like either badly formed API calls, or something weird with auth.
It also probably needs an update to use api.machines.dev. It is currently trying to connect over wireguard to hit our private API, because we didn't have a public API endpoint when it was built. This is probably brittle. :)
Thanks! Is there any documentation on the stable API? The link I posted has some normal REST stuff for machines, but nothing on other resources. I can't find any central API reference/etc in your docs.
Looking at the code, it uses GraphQL? I can't find any documentation on your GraphQL API either though, aside from a web editor that doesn't appear to be linked anywhere official.
As fly apps v2 becomes more stable the docs will continue to improve. Watch this space :). As for the graphql, it's a bit of a thorny situation. The graphql exists primarily to service our needs internally and as such has no guaranteed contract. Unfortunately, however, for some things that's the only API that exists. The autogenerated docs in the playground are pretty good actually though. That said, we are working on improving our rest api so more things will be available there.
No, ordinary Fly apps (what we call v1 apps) are managed by nomad. They don't scale to zero. You can create a machine based app and it'll scale to zero just fine, but you lose most of the magic of `fly deploy` right now.
Pretty tiny cloud costs either way, probably safe to assume they’re a small, early stage startup. But even still, it was likely a bad call to spend significant engineering effort on saving $110/month, which is a rounding error even for small startups. Hard to imagine that time couldn’t have been better spent on things that grow the business, like new features and bug fixes.
Definitely tiny but we're just overall happier with Fly. The migration in total probably cost us a weekend of work all said and done. We get a lot for free with Fly (observability, multi-region, etc.) so all in all a win.
I would read it as a 60% saving. That's a very good one. Considering that this is a recurring cost and not a one time and it's proportional to their sales. Aka, they now will be able to provide their clients twice the amount of contents for the same price. At least in theory.
My thought was more timing - there’s so much that’s URGENT for a young startup, is cost cutting really urgent, something you have to do NOW, when the current savings are just $110/month? Part of what allows startups to survive is ruthless focus on the work that that’s most important today, and saving $110/month seems unlikely to be that.
However, the blog post author responded, noting that this migration took only 1 weekend, and they like other aspects of Fly than just the cost. So fair enough, 1 weekend is a small investment, less than I expected, I would have guessed this took 2+ weeks.
I've migrated several services from GCP to Fly.io, and I've been very happy with Fly.
All of my apps are low-volume hobbyist web apps that run on a single machine, so I don't do any Terraform, k8s, or Postgres, so my use case is a little simpler than OP.
My biggest complaint has been with outages,[0, 1, 2] but that's gotten better. The original architecture meant that larger servers could evict other users' running servers, which they admitted was a bad idea.[2] I'm not sure if they've fixed that since, but I haven't seen it happen in a few months.
>The container logging solution provided by Fly.io is basic. It's easy to view logs with the Fly.io CLI and via the Fly.io dashboard. However, there's only a small window of logs that are kept, forcing you to create a remote logging solution either internal to your application or via a separate Fly.io application that ships logs to an external service. This is a piece of operational overhead that I'd like to see removed.
I very much agree. This is a big feature gap I feel when using Fly as well.
>Fly.io is not a support-first company. If this bothers you, then Fly.io is probably not for you.
>When emailing support, and email is the only option, it can take from hours to days to get a response. Sometimes they don't follow up. It does not feel like their level of support is on par with other cloud providers.
This hasn't been my experience. I go through the forum rather than email, but I typically see responses within hours, even on the weekends.
By contrast, when I reported bugs to GCP through their designated support channels, there was multi-day latency, and they'd just keep dismissing my report.[3]
You probably had a different community experience than recent folks. We've grown like 4x in the last 3 months and haven't kept up with community as well as I'd like (because hiring). And over holidays, we've had a difficult time with email support as well. It'll get better, but it's not great right now!
To avoid outages, just scale up the number of app instances to like 2 or 3 and choose a diverse geography for them.
Outages is not something specific to Fly. For example, if you have an app on Azure and it has only one instance then... well, you asked for it to go down from time to time.
Maybe not a big deal for hobby projects, but for business critical stuff one should always consider deploying several instances of the same app when possible.
The key moment is not to have any persistent state in the app. App may use a database, that's ok, but the app itself should not store any persistent state on a hard drive.
Once that simple requirement is met, the scaling becomes easy.
Biggest negative so far: no easy way to reset the database [1]. I'd typically like to blow it away and re-create it frequently as I tinker with it e.g. adding/removing/renaming a column or changing a data type. Those small changes now require migrations, which are a fair bit more work (but not a huge problem tbh).
Biggest positive: ability to select a zone meant response time for Australian users was very fast compared to heroku; making a big difference to UX.
This is stuff where I highly recommend writing up helper scripts. I have one to pull in stuff from a fly Postgres DB, spin up my local docker compose setup, and then pump the data into there.
I think that's the real niceness here, a CLI that is pretty straightforward (not perfect by any means! but I can still wrap my head around it)
Sounds awesome/slick. If you know of any examples that can be shared would love to see them. My use of the fly CLI is rudimentary at best, and I don't use scripts at all, just (mostly basic) commands.
With all the talk of PostgreSQL, I'm curious if anyone has tried the offerings Fly.io has been pushing with SQLite (Litestream & LiteFS) in any serious capacity, and can comment on the experience.
Wondering whether they would be a viable alternative to Postgres in the typical "Database + Docker Container" architecture.
Litestream & LiteFS author here. Tailscale has been using Litestream in production for a while and blogged about it in April 2022[1]. I've had a number of folks contact me out-of-band and tell me they're using in production at a surprising scale.
As for LiteFS, it's still early beta stage so I don't know of many people using it in a production capacity. I would expect it to be another six months or year before confidence builds enough to see more regular usage.
I moved a service that was costing me $300+/mo to Fly and this month it was like $15. Some months I fall under the minimum $5 billing threshold and it is just free.
They are the only provider that has nailed the killer feature: turn off shit while it isn't in use.
This. We have a similar situation with Azure services being gradually migrated to Fly. Stuff that costed us $200+ to do on Azure is just like $15 on Fly.
I tested websocket on fly about 2 years ago, it seemed to have kinda weird characteristic. It's hard to explain because I can't dig it. It's like throttle + batch scheduling somehow (I don't know what it actually does though)
I tested Fly, DO app platform, Render. Ended up with DO (but it's meh, can't mount disk volume, can't deploy to a specific sha code git)
I'm going to [stress]test their wss:// again this year, I'm thinking to finally move a commercial one to fly (already have hobby apps there)
Did we diagnose the TLS certs expiring? This is most often something like a CAA record that fails invisibly. We're not very good at notifying you of these kinds of issues, but certs should always renew when the config remains good.
People who proxy through cloudflare hit this a lot, unfortunately.
CF is my DNS. Not sure if I had the proxy enabled. I was able to re-configure everything and it started working. I made a reminder to check around the 65 day mark to see if it successfully renews.
I suspect what may have caused the issue is I have multiple fly hosts on subdomains, but each subdomain had a wildcard certificate against the root (*.example.com).
With Heroku, I never had to think about expiring certificates. With FlyIO, my app was down for 4 hours and I only new about it because of 3rd party uptime monitoring.
Since the fly folks are here: I'd love the ability to trigger jobs as I can with aws batch. The ability to do `fly run -a journalize-batch --command="python3 upload.py"` is something I need right now and am in the process of setting up aws batch to do it.
I recently migrated some services from DigitalOcean to Fly and it was quite pleasant. My only issue is that routing to Fly from Southern California is awful.
My friend and I have different ISPs and yet we’re both routed to the east coast instead of the much closer Los Angeles edge server.
So our requests end up going to the east coast(edge), then back to the west coast (app), back to the east coast(edge), and then back to west coast (origin).
The latency adds up and makes requests to a nearby app almost as slow as to one hosted across an ocean.
I’m worried that other users will hit this and make hosting on Fly terrible for apps that require lower latency :/
This is uncommon. When did it happen? We had some cross country routing issues a few weeks ago. We can generally fix these if you don't mind sending us traveroutes.
I really do appreciate it. Is there any way Fly could detect this? I know geo IP isn’t perfect but it seems like it could help you track this stuff by allowing you to compare edge location to the users location.
There probably is a good way to do this. Do you have thoughts on things we should do here? We have a code footprint in every region (and also in like 6 off-platform regions on other providers) for doing this kind of monitoring, and we're probably not putting it to the most effective use we can. Ideas would be most welcome.
This idea might be a bit too naive but it seems like it could help
1) Record the IP address and edge of connecting clients
2) Use an IP API to determine the rough location and ISP of clients based on their IP
3) Spatially query the edge closest to each client and graph/log serious mismatches (>x miles off)
4) Cast the spells required to route offending ISP better
5) See if the data improves :)
For example, the IAD edge would record that I connected from San Diego via Cox Communications. My closest edge geographically is LAX so that's ~2500 miles of excess travel!
I usually just go complain to network folks and they figure it out. I'm good at complaining!
Routing issues like this are typically the result of some weird network provider politics. The fix is typically ti change how we announce IPs to force a route to do something we want. It's a dark art.
So it seems the 'machine' is the equiv of AWSs EC2? I think I would only save $5~ by not having to have HAProxy (cos AWS LB is exensive) but I would need to manage PG myself which seems like a hassle for a hobby project.
> Fly.io is not a support-first company. If this bothers you, then Fly.io is probably not for you.
> When emailing support, and email is the only option, it can take from hours to days to get a response. Sometimes they don't follow up. It does not feel like their level of support is on par with other cloud providers.
This enough to keep me on AWS. You can say a lot about AWS but their support is great
Help me understand what you'd be SSH'ing into before your VM boots up? (I'm responsible for most of our SSH goop, so if there's an improvement to be made, I'm happy to make it).
A reminder: we don't run "containers". We take containers, unpack them, and transform them into VMs. There's no host OS for you to access.
Caveats: 1) this may not be what grandparent was trying to do and 2) I have limited experience debugging "regular" Docker, let alone firecracker or whatever Fly.io does.
With that said, I succesfully deployed a simple Rust/Actix/Sqlite app with a Dockerfile to Fly.io. I then thought I'd try out litestream. For reasons I'm still not sure about, having `ENTRYPOINT ["litestream replicate … -exec myapp"]` resulted in immediate kernel panics.[1]
As I was debugging, my instinct was to want to `fly ssh console` into a running container to see if running the same command from the shell produced any clues for further debugging. Though I understand this is the wrong mental model, the thought was something like "well, I know everything else works, I just need Fly to ignore this failing command so I can have a minute to poke around." To do this, I ended up just removing litestream from the ENTRYPOINT so the deploy would succeed, then I could SSH in and play around at the shell to see what was going on.
Again, I have no idea whether this is the same sort of problem the other person was having, but for my case, what would be helpful is probably not changing how `fly ssh console` behaves, but perhaps some documentation of suggested debugging techniques in case your app is failing to start.
[1] Putting the exact same command into a "start.sh" and making that the ENTRYPOINT worked fine so that's what I ended up doing.
One of my biggest UX nits with Fly (I have no excuses, I have all the access I need to go fix this myself) is that we "kernel panic" when your entrypoint command fails. Of course, our kernel is not really panicking --- we've just run out of things for our `init` to manage, so it exits, and when pid 1 exits, so does the kernel. But you get the terrifying stack dump.
We can clean this up, so that you get a clearer, simpler error ("your entrypoint exited, here's the exit code, there's nothing else for this VM to do so it's exiting, have a nice day"), and it's been on the docket for months. We'll get it done!
We could conceivably add a flag for our `init` to hang around waiting for you to SSH in after your entrypoint exits. But that's clunky and complicated. Usually, you want your kernel to exit when your entrypoint fails, so that your service restarts! What you should do instead is push a container that has enough process supervision to hang around itself. Here's a doc:
That all makes sense. I think context matters. Yes, when I have a service that's been up and running for some time, if that service fails, I want my service to restart.
But when I'm trying to get a service running for the first time and I'm not sure that I have the right command in the entrypoint, the right arguments to that command, or the right supporting files in place, or the right libraries installed, or the right file permissions, …, well, then I don't want things to just blindly restart, I want a handle and some information so I can figure out why it isn't working.
ETA: I recognize that your link to docs about running a supervisor addresses this problem. For me this raises some interesting questions. Like, I understand why Ben would implement `litestream exec` but maybe it would be better to steer users to a proper supervisor? Separately, what if it's the supervisor that's failing? Now I'm back to seeing kernel panics and not having error messages or a shell.
Usually, when I'm debugging a container, I start with a `tail -f /dev/null` entrypoint, or something like that, and then just shell into it to run the real entrypoint to see if it's working.
> We could conceivably add a flag for our `init` to hang around waiting for you to SSH in after your entrypoint exits. But that's clunky and complicated
This is a common thing in CI platforms, and the way they usually expose this is "run tests with SSH enabled", and they keep it open for 30 minutes/2 hours/whatever until a session closes.
So if I have some app failing, being able to run `fly restart --ssh-debug`, having that first just sit around waiting for the app to boot, and then dropping into ssh would be a very helpful piece of UX. The main thing is cleanup, but y'all charge for compute! You can be pretty loosy-goosy on that one honestly.
Maybe he means/wants some kind of “console” access for his “instance” ? Seems crazy as it takes I’m sure less than a second to boot, and if there is a problem booting up it’s probably not going to be a problem that a user/customer would be able to solve at console.
For me, Fly.io is like a seminal step forward to an imaginary OS of the future: standardized, auto-clustered, just works, dead simple to grasp. Kind of Windows 95 of its time.
Since the fly folks are here, do you plan to support monorepo apps in a single fly.toml. i.e define all services in a single fly.toml like docker compose or render's blueprint?
Also a way to reuse the definition but run a different command. For example for a rails app, web worker (puma) and job worker (sidekiq) differs only in the command run. In fly, I would have to duplicate those. If there is an easier way, do let us know.