Ask HN: Have you ever switched cloud?

vidarh · on April 7, 2022

Yes. I once did zero downtime migration first from AWS to Google, then from Google to Hetzner for a client. Mostly for cost reasons: they had a lot of free credits, and moved to Hetzner when they ran out.

Their savings from using the credits were at least 20x what the migrations cost.

We did the migration by having reverse proxies in each environment that could proxy to backends each place, set up a VPN between them, and switched DNS. Trickiest part was the database failover and ensuring updates would be retried transparently after switching master.

Upside was that afterwards they had a setup that was provider agnostic and ready to do transparent failover of every part of the service, all effectively paid for through the free credits they got.

michealr · on April 7, 2022

Would you have a write up in more detail of what you did, even high level. Seems cool thing to do

vidarh · on April 7, 2022

Unfortunately not, but it's surprisingly straight-forward, apart from the database bit, but here's a bit more detail from memory. There are many ways of doing this and some will depend strongly on which tools you're comfortable with (e.g. nginx vs. haproxy vs. some other reverse proxy is largely down to which one you know best and/or already have in the mix) [Today I might have considered K8s, but this was before that was even a realistic option, but frankly even with K8s I'm not sure -- the setup in question was very simple to maintain]:

* Set up haproxy, nginx or similar as reverse proxy and carefully decide if you can handle retries on failed queries. If you want true zero-downtime migration there's a challenge here in making sure you have a setup that lets you add and remove backends transparently. There are many ways of doing this of various complexity. I've tended to favour using dynamic dns updates for this; in this specific instance we used Hashicorp's Consul to keep dns updated w/services. I've also used ngx_mruby for instances where I needed more complex backend selection (allows writing Ruby code to execute within nginx)

* Set up a VPN (or more depending on your networking setup) between the locations so that the reverse proxy can reach backends in both/all locations, and so that the backends can reach databases both places.

* Replicate the database to the new location.

* Ensure your app has a mechanism for determining which database to use as the master. Just as for the reverse proxy we used Consul to select. All backends would switch on promoting a replica to master.

* Ensure you have a fast method to promote a database replica to a master. You don't want to be in a situation of having to fiddle with this. We had fully automated scripts to do the failover.

* Ensure your app gracefully handles database failure of whatever it thinks the current master is. This is the trickiest bit in some cases, as you either need to make sure updates are idempotent, or you need to make sure updates during the switchover either reliably fail or reliably succeed. In the case I mentioned we were able to safely retry requests, but in many cases it'll be safer to just punt on true zero downtime migration assuming your setup can handle promotion of the new master fast enough (in our case the promotion of the new Postgres master took literally a couple of seconds, during which any failing updates would just translate to some page loads being slow as they retried, but if we hadn't been able to retry it'd have meant a few seconds downtime).

Once you have the new environment running and capable of handling requests (but using the database in the old environment):

* Reduce DNS record TTL.

* Ensure the new backends are added to the reverse proxy. You should start seeing requests flow through the new backends and can verify error rates aren't increasing. This should be quick to undo if you see errors.

* Update DNS to add the new environment reverse proxy. You should start seeing requests hit the new reverse proxy, and some of it should flow through the new backends. Wait to see if any issues.

* Promote the replica in the new location to master and verify everything still works. Ensure whatever replication you need from the new master works. You should now see all database requests hitting the new master.

* Drain connections from the old backends (remove them from the pool, but leave them running until they're not handling any requests). You should now have all traffic past the reverse proxy going via the new environment.

* Update DNS to remove the old environment reverse proxy. Wait for all traffic to stop hitting the old reverse proxy.

* When you're confident everything is fine, you can disable the old environment and bring DNS TTL back up.

The precise sequencing is very much a question of preference - the point is you're just switching over and testing change by change, and through most of them you can go a step back without too much trouble. I tend to prefer ensuring you do changes that are low effort to reverse first. Need to keep in mind that some changes (like DNS) can take some time to propagate.

EDIT: You'll note most of this is basically to treat both sites as one large environment using a VPN to tie them together and ensure you have proper high availability. Once you do, the rest of the migration is basically just failing over.

baq · on April 7, 2022

People get paid hard cash for lower quality plans than you’ve just provided, thanks a lot! :)

justsomehnguy · on April 7, 2022

> If you want true zero-downtime migration there's a challenge

It is astounding how many people require 24/7 ops... while working 8/5.

Otherwise this comment is an exemplar on how things should be done. My take on this is what OP is a sysadmin, not a dev. *smug smile*

vidarh · on April 8, 2022

> It is astounding how many people require 24/7 ops... while working 8/5.

In this case the client had an actually global audience. They could have afforded downtime for the actual transition, but it was a usual test for the high availability features that mattered for them.

I do agree with the overall principle, though - a whole lot of people think they need 24/7 and can't afford downtime, yet almost all of them are a lot less important than e.g. my bank, which do not hesitate to shut down their online banking for maintenance now and again. As it turns out, most people can afford downtime as long as it's planned and announced. Convincing management of that is a whole other issue.

> My take on this is what OP is a sysadmin, not a dev. smug smile

Hah. I'd say I was devops before devops was a thing. I started out writing code, but my first startup was an ISP where I was thrown head-first into learning networks (we couldn't afford to pay to have our upstream provider help set up our connection, so I learnt to configure cisco routers while having our provider on the phone and feigning troubleshooting with a lot of "so what do you have on your side?") and sysadmin stuff, and I've oscillated back and forth between operations and development ever since. Way too few developers have experienced the sysadmin side, and it's costing a lot of companies a lot of money to have devs that are increasingly oblivious to hardware and networks.

throwawayboise · on April 8, 2022

> It is astounding how many people require 24/7 ops

Yet when us-east-1 goes offline, it's mostly just shrug wait for it to come back because it's not our fault...

jrumbut · on April 7, 2022

Really well done keeping this simple!

It's also another one of those situations where good design principles and coding practices pay off. If the app is a tangled mess of interconnected services, scripts, and cron jobs this kind of transition won't be possible.

ketzo · on April 7, 2022

Damn, this is why I come to HN. This is awesome, thank you so much for taking the time to write it up.

cosmodisk · on April 7, 2022

This was really nice to read. Thanks!

sruffatti · on April 7, 2022

Bump! This sounds very interesting.

withinboredom · on April 7, 2022

Highly recommend WireGuard for this (see kilo for k8s specific that works with whatever network you have setup). Setting up a VPN that just works is super simple.

johnthescott · on April 7, 2022

yep, wireguard is the secret for intercloud, for sure.

czernobog · on April 7, 2022

Same. Bump!

AtlasBarfed · on April 7, 2022

We haven't done a cloud migration, but I know from zero-downtime DATABASE migrations that if you do this:

- don't use any cloud service that isn't a packaged version of an installable/usable OSS project

- architect your services to be able to double-write and switchover the read source with A/B deployments

If you can migrate your database without downtime this way, then you are much more flexible than if not.

logifail · on April 7, 2022

> architect your services to be able to double-write

Can you share any details on how to achieve this?

For instance, if the first database accepts the write but the second is temporarily inaccessible or throws an error, do you roll back the transaction on the first and throw an error, or <insert_clever_thing_here> ... ?

vidarh · on April 7, 2022

You're right double-writes are flexible, and it's great when it works. With schema migrations I'm fine with it, because you can usually enforce consistency.

But with migrations from one database to another at different locations, I'm lukewarm to it because it means having to handle split-brain scenarios, and often that ends up resulting in a level of complexity that costs far more than it's worth. Of course your mileage may vary - there are situations where it's easy enough and/or where the value in doing it is high enough.

tinco · on April 7, 2022

At my previous job we ran a bare metal cluster at Hetzner, and monitoring the hardware was quite an intensive task. Always monitoring hard drives network bottlenecks, CPU usage, etc. This was before K8s, so it might not be comparable to today.

Would you say bare metal cost a lot of extra monitoring/maintenance, or is this something you did on the cloud hardware as well anyway? Do you run virtualization on the Hetzner machines?

vidarh · on April 8, 2022

I would say cloud costs a lot of extra maintenance. When I did contracting, those of my clients who insisted on AWS tended to be far more profitable clients for me, because they needed help with so much more stuff.

In terms of monitoring, it boils down to picking a solution and building the appropriate monitoring agent into your deployment.

I've run basically anything I run in some virtualized env. or other since ~2006 at least, be it OpenVz (ages ago), KVM, or docker. And that goes for Hetzner too. It makes it easy to ensure that the app environment is identical no matter where you move things. I managed on environment where we had stuff running on prem and in several colo's, on dedicated servers at Hetzner, and in VMs, and even on the VMs we still containerised everything - deployment of new containers was identical no matter where you deployed. All of the environment specific details were hidden in the initial host setup.

hardwaresofton · on April 7, 2022

Just a note, there are some companies that have been popping up recently to try to bridge the gap for services on clouds like Hetzner.

https://elest.io

https://nimbusws.com (I'm building this one so I'm biased for it).

> Would you say bare metal cost a lot of extra monitoring/maintenance, or is this something you did on the cloud hardware as well anyway? Do you run virtualization on the Hetzner machines?

It cost a lot of monitoring/maintenance up front, but once things are purring the costs amortize really well. Hetzner has the occasional hard drive failure[0], but you should be running in a RAIDed setup (that's the default for Hetzner-installed OSes), so you do have some time. They also replace drives very quickly.

If you really want to remove this headache, you run something like Ceph and make sure data copies are properly replicated to multiple hosts and you'll be fine if two drives on a single host die at the same time. Obviously nothing is every that easy but I know that I spend pretty much no time thinking about it these days.

I run a Kubernetes cluster and most of my downtime/outages have been self-inflicted, I'm wonderfully happy with my cluster now. Also another thing to note is that control plane downtime != workload downtime, which is another nice thing -- and you can hook up grafana etc to monitor it all.

[0]: https://vadosware.io/post/handling-your-first-dead-hetzner-h...

zomglings · on April 7, 2022

My estimates - the company had a combined total of $200,000 dollars in credits across both clouds and the OP charged $10,000 for the migrations.

vidarh · on April 7, 2022

You're reasonably in the ballpark. It was complicated a bit because hosting on Hetzner was so much cheaper for them that $1 in credit was not worth $1 to them, as if/when they had to spend cash they spent substantially less than that at Hetzner.

zomglings · on April 8, 2022

Thanks. Yeah, I totally understand that dynamic. Now that we have run out of credits on AWS and are shifting some of our workloads into a datacenter ourselves. :)

jrumbut · on April 7, 2022

Several years ago I had a string of clients who also came into enough Google Cloud credits to make a switch worthwhile.

For these companies it wasn't a problem to have a few minutes of downtime, so the task was simply recreating their (usually AWS) production environment in Google Cloud.

vidarh · on April 7, 2022

If it had been a one off migration we might have done the same, but since the end goal was Hetzner from the start, and that meant their architecture needed to handle the HA piece anyway, and served as nice validation we really could fail over without anything going down.

teh_klev · on April 7, 2022

I hate to be the bean counter, but what was the true cost ultimately? As you and your team's cost as resources.

It's nice that you ended up with a provider agnostic capability to deploy anywhere, but none of that was free in terms of ownership costs to get there.

vidarh · on April 8, 2022

I was the only person doing the migration work and setting up the HA setup and documenting it for them to take over. My fee for setting it up and doing it accounted for that 20x difference. Hetzner in the end was far cheaper for them to run including the devops work they contracted to me. I effectively got paid to reduce my future earnings from them. But that was fine; when I was doing contracting, that was a big part of my pitch - that I'd help drive down their costs and pay for myself many times over - if people were in doubt I'd offer to take payment as a percent of what they saved.

So, no, it wasn't free, but it saved them far more money than it cost them, both the initial transition and in ongoing total cost of operation.

In fact, my first project for them was to do extensive cost-modelling of their development and operations.

teh_klev · on April 8, 2022

Ah, now I understand. Sorry it hadn't joined up in my brain you were providing this as a service as a third party. Thanks for the explanation.

zamadatix · on April 8, 2022

See this older sibling comment https://news.ycombinator.com/item?id=30950842

YorickPeterse · on April 7, 2022

At GitLab we went from AWS to Azure, then to Google Cloud (this was a few years ago). AWS was what we started with, and I think like most companies very little attention was paid to the costs, setup, etc. The result was that we were basically setting money on fire.

I think at some point Azure announced $X in free credits for YC members, and GitLab determined this would save us something like a year's worth in bills (quite a bit of money at the time). Moving over was rather painful, and I think in the months that we used Azure literally nobody was happy with it. In addition, I recall we burned through the free credits _very_ fast.

I don't recall the exact reasoning for switching to GCP, but I do recall it being quite a challenging process that took quite some time. Our experiences with GCP were much better, but I wouldn't call it perfect. In particular, GCP had/has a tendency to just randomly terminate VMs for no clear reason whatsoever. Sometimes they would terminate cleanly, other times they would end up in a sort of purgatory/in between state, resulting in other systems still trying to connect to them but eventually timing out, instead of just erroring right away. IIRC over time we got better at handling it, but it felt very Google-like to me to just go "Something broke, but we'll never tell you why".

Looking back, if I were to start a company I'd probably stick with something like Hetzner or another affordable bare metal provider. Cloud services are great _if_ you use their services to the fullest extend possible, but I suspect for 90% of the cases it just ends up being a huge cost factor, without the benefits making it worth it.

vidarh · on April 8, 2022

I love Hetzner, and host most of my own stuff on it. And it takes so little to be prepared to move. E.g. a basic service discovery mechanism, a reverse proxy and putting things in containers and you can migrate anywhere. Now that Hetzner has some cloud features too I see even less reason to go elsewhere (though in my current job we use AWS, but we use AWS with the explicit understanding that we're low volume - currently mostly running internal tools - and can afford the premium; if we needed to scale I'd push for putting our base load somewhere cheaper, like Hetzner)

One additional suggestion to people considering bare metal: Consider baking in a VPN setup from the start, and pick a service discovery mechanism (such as e.g Consul) that is reasonably easy to operate across data centres. Now you have what you need to do migration if you need to, but you also have the backbone to turn your setup into a hybrid setup that can extend into whichever cloud provider you want too.

A reason for wanting that is that one of the best ways I've found of cutting the cost of using bare metal even further is to have the ability to handle bursts by spinning up cloud instances in a pinch. It allows you to safely increase the utilisation levels of your bare metal setup substantially with according cost savings even if you in practice rarely end up needing the burst capability. It doesn't even need to be fully automated, as long as your infra setup is flexible enough to accommodate it reasonably rapidly. E.g. just having an AMI ready to go with whatever you need to have it connect to a VPN endpoint and hook into your service discovery/orchestration on startup can be enough.

zegerjan · on April 7, 2022

Azure to GCP was right after GV became an investor. Again, credits was the reason to change over. Not the poor Azure performance.

netcoyote · on April 7, 2022

One of the things I'm immensely curious about is how you handle security/networking/firewalls when working with Hetzner or other bare metal providers? It seems they don't provide network gear for firewalls and to protect against DDOS attacks?

Do you just use iptables? Or do you build out more complex solutions like software routers running on Linux/BSD?

I work in online gaming, and we're constantly seeing attacks on our infra.

dijit · on April 7, 2022

Hey! I also make games and we solved the issue by doing stateless ACLs on the border switches and having a really fat pipe.

You can add a magic header to traffic and drop anything that doesn’t contain the header.

Since this is done in hardware it operates at line speed. In our case 100GBit/s.

ramshanker · on April 8, 2022

This seems such a good trick. We could even do it at cloudflare rule level I guess.

mbreese · on April 7, 2022

If you have zero-cost internal network costs, I'd consider adding another server in front of the primary servers to act as a reverse-proxy and/or firewall. Basically, you'd use that server as a firewall and then pass only the good traffic onwards to your game servers, which are probably bigger and more expensive.

If there isn't a possibility for internal-networking (free), then I'd probably use the included iptables for a firewall on each machine. You should honestly have this running on the game servers anyway, if only to restrict communication to between the reverse-proxy and game server.

hardwaresofton · on April 7, 2022

Hetzner has some built in DDOS protection but you should add your own:

https://www.hetzner.com/unternehmen/ddos-schutz

k8sToGo · on April 7, 2022

I wish Hetzner provided RDS

hardwaresofton · on April 7, 2022

So it’s a bit in the future but managed postgres is something I want to offer through nimbus[0]. In the Meantime if you’re interested in other managed services, please sign up!

[0]: https://nimbusws.com

anonymousDan · on April 7, 2022

Wow, never heard of Nimbus but the kickbacks for open source projects are an awesome idea - fair play.

hardwaresofton · on April 7, 2022

Thanks! It's not out yet :) I've been working on it for a bit and I really think this is what's missing from the hyperscaler/cloud model.

If we just put a LITTLE bit back into (like 5% of revenue for any of the big companies honestly -- but I'm not big yet so more for me) the F/OSS ecosystem, can you imagine what kind of world we'd be in??

I want to live in that world, so I'm trying to make it happen.

k8sToGo · on April 7, 2022

I think your website is broken on mobile. The alignment is off and the contact form is cut off.

hardwaresofton · on April 7, 2022

oh thanks -- I think I've got some image-jumping-border issues, looking into it right now!

orware · on April 7, 2022

It's still a bit early for the company but if you like RDS, PlanetScale might be worth a look too: https://planetscale.com/

InflexHQ · on April 7, 2022

I switched from Azure to DigitalOcean to Hetzner. Reasons were as you stated, simpler cost model, simpler technology.

bootcat · on April 7, 2022

does hetzner have servers in US ? if not - if your app ok to be hosted in the EU ?

bee_rider · on April 7, 2022

I was also wondering this so I just looked it up -- looks like they recently (well, Nov. 2021) expanded to Virginia.

InflexHQ · on April 9, 2022

My business is based in the UK which is compatible with EU’s GDPR, so it’s fine to be hosted there.

Johnny555 · on April 7, 2022

I think at some point Azure announced $X in free credits for YC members, and GitLab determined this would save us something like a year's worth in bills (quite a bit of money at the time). Moving over was rather painful, and I think in the months that we used Azure literally nobody was happy with it. In addition, I recall we burned through the free credits _very_ fast.

That sounds like the worst reason to change cloud providers: "because that provider bribed me to"

eric4smith · on April 7, 2022

Moved from Google Cloud -> Digital Ocean -> OVH.

Running our own stuff on high powered servers is very easy and less trouble than you think. Sorted out the deploy with a "git push" and build container(s) meant we could just "Set it and forget it".

We have a bit under a terabyte of Postgresql data. Any cloud is prohibitively expensive.

I think some people think that the cloud is as good a sliced bread. It does not really save any developer time.

But it's slower and more expensive than your own server by a huge margin. And I can always do my own stuff on my own iron. Really, I can't see a compelling reason to be in the cloud for the majority of mid-level workloads like ours.

lolinder · on April 7, 2022

> Really, I can't see a compelling reason to be in the cloud for the majority of mid-level workloads like ours.

I work on a very small team. We have a few developers who double as ops. None of us are or want to be sysadmins.

For our case, Amazon's ECS is a massive time and money saver. I spent a week or two a few years ago getting all of our services containerized. Since then we have not had a single full production outage (they were distressingly common before), and our sysadmin work has consisted exclusively of changing versions in Dockerfiles from time to time.

Yes, most of the problems we had before could have been solved my a competent sysadmin, but that's precisely the point—hiring a good sysadmin is way more expensive for us than paying a bit extra to Amazon and just telling them "please run these containers with this config."

kissgyorgy · on April 7, 2022

> None of us are or want to be sysadmins.

It's such a huge misconception that by using a cloud provider you can avoid having "sysadmins" or don't need that kind of skills. You still need those, no matter which cloud and which service you use.

lolinder · on April 7, 2022

Which skills specifically do you think we might be missing that we would need to run an app on a managed container service and managed database?

I know how to configure firewalls, set up a (managed) load balancer, manage DNS, and similar tasks directly related to getting traffic to my app.

What I no longer have to know how to do: keep track of drive space, manage database backups, install security updates on a production server without downtime, rotate SSH keys, and a whole bunch of other tasks adjacent to the app but not actually visible to incoming traffic at all.

kissgyorgy · on April 7, 2022

You still need to do backups, a database backup is just one part of that, if you are not following the 3-2-1 rules and don't test your restore mechanism, you don't have reliable backups.

Those things you listed are sill sysadmin tasks in my eyes, and you are doing them, validating my point.

You still have to track storage space, either because you are paying for it and need to expand when necessary, or you have to manage costs at one point, that's not completely out of the picture. It can be easier for sure than building your own storage hardware.

You still need to keep systems up-to-date either you are using Docker so you are doing it on your "application level" or you are using Linux VMs and you need to upgrade those systems/images. Even if you are using something like Functions or Lambda, those have their own environment which you need to be aware of and they usually support specific versions of programming languages, so you need to upgrade your own stack when they don't support older versions anymore.

lolinder · on April 7, 2022

I tell you that ECS has eliminated a ton of extra work for my team for a bargain price, and your response is "but you still have to do x, y, and z!" It's like saying that I shouldn't buy a dishwasher because I'd still have to wash the pots.

Yes, we still need to do some sysadmin-y tasks. But ECS handles so many of them that we actually have the time, energy, and knowledge to take care of the few that remain.

(As an aside, keeping language and OS versions up to date becomes a development task rather than an ops task when running Docker + ECS. We increment a version number in the repository and test everything, the same as we do for any library or framework that we depend on.)

vidarh · on April 8, 2022

> As an aside, keeping language and OS versions up to date becomes a development task rather than an ops task when running Docker + ECS.

It's a development task with a proper bare metal setup too.

_huayra_ · on April 7, 2022

If you use purely PaaS offerings (or FaaS as well), then you also don't really need sysadmins.

That's not to say that you can get away with knowing _no_ sysadmin skills in these scenarios, but you don't need to have someone on staff who knows the ins and outs of Cassandra or Mongo or whatever you're using. In awful workplaces with high turnover, it's worth it for management to opt for these managed services so that when the overworked tech lead decides to rightfully bail on them, she/he doesn't leave them in the lurch. (Note: I'm not defending these workplaces, but just explaining that when they can't keep adequate in-house talent to manage their own services, it makes financial sense to outsource it, and pay the "cloud tax").

_clhx · on April 7, 2022

I think the problem with cloud environments is you do not "need" sysadmins - it is not obvious you need them, so what you end up with is a bunch of systems glued together without much thought, and then crazy things like HTTP logs not being turned on for your various services, insane service costs b/c of not understanding pricing tiers, etc..

Daishiman · on April 7, 2022

The difference doing ops setting up a couple of Lambdas or Fargate containers vs provisioning your own servers is substantial.

exfascist · on April 7, 2022

In fact, if you're using Linux on your workstation you'll use the same skills locally as you do on the VPS/bare metal (depending on your scale.) Arguably "cloud" services need more sysadmin skills, not less.

ChasingEchoes · on April 7, 2022

Thats a very big if.

I have yet to work with a $corp that uses Linux for workstations.

Overwhelming majority uses Windows. Some use macOS.

The ocasional developer that uses linux will usually be in a VM, or if IT policies allow, WSL.

So yeah, running cloud services doesnt require sysadmin skills, unless you assume copy pasting from oficial documention "sysadmin skills".

xedrac · on April 7, 2022

That's funny... every team I've been on in the last 10 years has used Linux workstations almost exclusively, with a few Macs here and there.

vidarh · on April 8, 2022

In 27 years, I've had exactly two jobs where I didn't have Linux on my desktop, for a total of 5 out of those 27 years. In both cases, I still did all of my dev work on Linux.

It boils down to what kind of jobs you look for.

> So yeah, running cloud services doesnt require sysadmin skills, unless you assume copy pasting from oficial documention "sysadmin skills".

If that's the extent of how you're managing your cloud setup, then I could equally argue running bare metal servers doesn't require sysadmin skills either. When I did contracting, a large part of my income was to come in and clean up after people had relied on "copy pasting from official documentation" as a substitute for actual ops.

lolinder · on April 7, 2022

It's far easier to maintain my own Linux workstation than an internet-facing server used daily by customers.

exfascist · on April 7, 2022

Absolutely but most of the knowledge translates, it's the procedures that differ.

lolinder · on April 8, 2022

It's those different procedures that I'm trying to avoid. It's not that I couldn't do those things or learn to do them, it's that my time is best spent building and improving our applications, not keeping servers running, secured, and up to date.

At some point we hope to get to the scale where it makes sense to pay a human to do that, but at this point the additional cost incurred by an ECS instance over an equivalent server is negligible.

TrickardRixx · on April 7, 2022

Very similar experience here. I work on a two person "DevOps" team. Without AWS ECS we would have to have a much higher headcount. I get to spend most of my time solving real problems for the engineers on the product team rather than sysadmin work.

dijit · on April 7, 2022

What are “real problems” for the engineers or product team?

TrickardRixx · on April 8, 2022

Things like automating manual workflows, building small infrastructure debugging tools, or providing infrastructure consultation to an engineer trying to decouple two parts of a legacy code base.

ctvo · on April 7, 2022

Managed container services (like Amazon ECS) are a sweet spot for me across complexity, cost, and performance. Mid size companies gain much of the modern development workflow (reproducibility, infrastructure as cattle, managed cloud, etc.) using one of these services without the need to go full blown Kubernetes.

It's lower level than functions as a service, but much cheaper, more performant, matches local developer setups more closely (allowing for local development vs. trying to debug AWS Lambda or Cloudflare FaaS using an approximation of how the platform would work).

jamesfinlayson · on April 8, 2022

Very much agree - due to a coworker leaving recently, I'm looking after two systems. They're both running on ECS and using Aurora Serverless.

My company takes security very seriously so if these two systems were running on bare-metal I'd probably be spending one day a week patching servers rather than trying to implement new functionality across two products.

eric4smith · on April 8, 2022

I can bet our team is smaller than yours.

And yet… Sysadmin tasks take up maybe 2 hours a month.

Your theory is right though if no one on your team knows how to setup servers. In your case the cloud makes sense.

holografix · on April 8, 2022

To the peeps running ECS? Why not just straight up AKS or GKE? Have you compared ECS to Cloud Run on GCP?

lolinder · on April 9, 2022

In my case, mostly because it was easier to get buy-in from the rest of the team on ECS than Kubernetes.

joncrane · on April 7, 2022

"Infrastructure is cheaper than developers(sysadmins)" all over again.

dx034 · on April 7, 2022

I also found that running a PostgreSQL database is really simple. Especially if most of your workload is read only, a few dedicated servers at several providers with a PostgreSQL cluster can deliver 100% uptime and more than enough performance for pretty much any use case. At the same time, this will still be cheaper than one managed database at any cloud provider.

I've been running a PostgreSQL cluster with significant usage for a few years now, never had more than a few seconds downtime and I spend next to no time maintaining the database servers (apart from patching). If most requests are read only, clusters are so easy to do in Postgres. And even if one of the providers banned my account, I'd just promote a server at another provider to master and could still continue without downtime.

I recently calculated what a switch to a cloud provider would cost, and it was at least 10x of what I pay now, for less performance and with lock-in effects.

But I understand that there are business cases and industries where outsourcing makes sense.

tomatowurst · on April 7, 2022

can you share more details, because im in teh process of doing the same since having a few terabytes of postgresql / dynamodb is stupid expensive.

for a lot big organizations its a matter of accountability. if they say AWS went down vs our dedicated servers went down, it matters a lot for insurance, clients.

what i dont get are 4 man startups paying thousands to AWS ... because everybody does it.

dx034 · on April 8, 2022

As I said, if most queries are read-only it's really simple. Streaming replication works very well out of the box, just make sure you keep enough WAL segments on master so that slaves can catch up after some downtime.

I have a 1-1 relationship between application servers and databases. The application queries replication delay and marks itself as unhealthy and reports an error if the delay is too high. You can also do that via postgres (max_replication_delay), but I found this way to allow for more graceful failovers.

With streaming replication, servers are completely identical, so you can easily provision a new server. Failover is done by just one command on a slave. I don't have automatic failover as I only needed to use that once in several years (and that was on purpose), I'd rather accept downtime than having an unwanted failover.

With that setup you can always failover and can scale read operations really well. There are solutions for postgres if you need more complicated setups, but I never looked into them.

If you're in Europe, it's really cheap to get a dedicated machine from Hetzner with a few TB of NVMe. Just pay the extra money for 10gbit link, otherwise replication will take forever. But there are also some decent providers in the US, it's just more expensive. But with Hetzner, a two machine setup will be <$500 per month for really beefy servers.

I'd just be careful with using block storage, I often found that to be a bottleneck with database servers. Local storage is almost always much faster.

But in the end it depends on your use case. In the end, your database will usually go down because of a bug in the application or some misconfiguration. Both can happen on any service. It's really so rare these days to lose a server without notice. And Postgres is really stable, I've never seen it crash.

Xenixo · on April 7, 2022

Maintainability is much easier on a well working cloud setup from people who have potentially less knowledge.

One company had 6 servers and used AWS snapshot for backup + managed MySQL.

Backup and recovery of that db is possible by more people in the team as if it would run as non managed service.

yonixw · on April 7, 2022

In my company, we were aware about the potential honeypots in each cloud and we developed our product from the first commit to be deployed on 3 (!) clouds: AWS, Azure, IBM.

And while we made it work by sticking to the least common denominator which was FaaS/IaaS (Lambda, S3, API GW, K8s). It was certainly not easy. We also ignored tools that could've helped us greatly only against a single cloud in order to be multi cloud.

The conclusion after 2 years for us is kind of not that exciting.

[1] AWS is the most mature one, Azure is best suited for Microsoft products and Old Enterprise features. And IBM is best if you use only K8s.

[2] Each cloud has a lot of unique closed code features that are amazing for certain use cases ( Such as Athena for S3 in AWS or Cloud Run in GCP). But relaying on them means you are trapped in that cloud. Looking back, Athena could have simplified our specific solution if we were only on AWS.

[3] Moving between clouds, given shared features, is possible, but is definitely not a couple clicks or couple of Jenkins jobs away. Moving between clouds is a full time job. Finding how to do that little VM thing you did in AWS, now in Azure, will take time and learning. And moving between AWS IAM and Azure AD permission? time, time and time.

[4] Could we have known which cloud is best for us before? No. Only after developing our product we know exactly which cloud would offer us the most suited set of features. Not to mention different credits and discount we got as a startup.

Hope this helps.

tinco · on April 7, 2022

Why did you feel IBM is best when you only use K8S? Our platform fully K8S and I'm looking to somewhere where we get more performance per buck, probably not IBM but I'm surprised it's even in the list. Do they have an extra nice K8S dashboard or something?

nijave · on April 7, 2022

Not sure where you're currently at and considering moving to, but GKE has been much simpler than EKS. Not sure about cost but it'll likely save some operations time (auto scaling is a single check box, no IAM, scaling controllers, etc)

bavell · on April 7, 2022

IME GKE on GCP is top-tier if you want a painless managed k8s offering. I've been running my business on it (solo founder) since ~2017 with minimal fuss. Hosting a couple dozen WP sites and some bespoke webapps/apis.

OliverGilan · on April 7, 2022

Second this question. Never heard of IBM being exceptional for k8s. Curious to know what makes OP say it’s so good.

yonixw · on April 7, 2022

Hi, I was in an IBM Startup Accelerator. I just got the feeling that they were pushing for it hardcore. Gave us startup credit, free training and free premium support for K8s. So If you want to go with K8s and you are a Startup, that is the best offering I experienced.

nijave · on April 7, 2022

Maybe they're using Openshift (OKD). Openshift has a bunch of value add features they add a template marketplace and turn it into a hybrid PaaS with some other multitenancy management bits afaik

jhickok · on April 7, 2022

OKD is slightly different than the OpenShift (OCP) provided on IBM Cloud. OpenShift is on all clouds but IIRC OpenShift on IBM Cloud receives OCP updates first.

mjb152 · on April 7, 2022

I'd throw in stability, IAM, storage and the management plane.

milesward · on April 7, 2022

Cloud Run works on all three major clouds, and VMWare, and Bare Metal. No lock-in here.

yonixw · on April 7, 2022

Do you mean GCP Cloud Run [1] ? I would love to have it on AWS and Azure if you have a link to share. Or do you mean it's possible, but through different services on each cloud?

[1] https://cloud.google.com/run

milesward · on April 7, 2022

Runs in Anthos: https://cloud.google.com/anthos/run/docs/install

ctippett · on April 7, 2022

I assume the OP is referring to Knative[0] which is the framework powering Cloud Run behind the scenes.

[0] https://knative.dev/

milesward · on April 7, 2022

No, Knative Serving, via Anthos fleets.

nijave · on April 7, 2022

Athena isn't fully closed source. It's a customized, hosted version of Presto originally built by Facebook.

Apache Drill is in a similar space to Athena and can query unstructured or semi structured data like object/S3

yonixw · on April 7, 2022

You are correct, But as I understand, They get "RAW" file access to the S3 hard disk (or the equivalent) in their solution. So no matter what solution I might spawn, it will always be slower and more expensive.

Maybe I should have said "closed internal access features"

bushbaba · on April 7, 2022

Doubtful. I've personally had projects pull many hundreds of gigabits per second of s3 throughput. How you architect and design has a large influence to your analytics performance.

dijit · on April 7, 2022

“Many hundreds”

Last time I talked to a technical person at AWS the limit was 5GBits. Wonder what you’re doing differently.

Perhaps that changed.

nijave · on April 9, 2022

There's another benchmark somewhere showing S3 can max out a 100Gbps instance.

https://github.com/dvassallo/s3-benchmark

Another potential issue is ListBucket rate limiting. If you have lots of small objects, you'll spend most of the time waiting to discover the names than transferring data

dastbe · on April 8, 2022

you were quoted a per-object rate.

gregmac · on April 7, 2022

Was it a business or technical decision to do multi-cloud?

Did you run simultaneously in 3 clouds? Can you explain the setup?

If not, did you do just run on each for a while to test, or have a reason to switch?

This is probably an impossible question to answer, but: were the savings/benefits of doing this actually worth the engineering costs involved in the end? Eg even if you chose what turned out to be the most expensive, worst option, would the company ultimately have been in a better place by having engineering focused on building things to increase customer value instead?

yonixw · on April 8, 2022

> Was it a business or technical decision to do multi-cloud?

> Did you run simultaneously in 3 clouds? Can you explain the setup?

The solution itself could be running on a single cloud. But we work in the finance sector and targeting highly regulated clients. And we got a tip very early on, that each client could ask for deployment on their cloud account that is monitored by them. Which will probably be AWS or Azure. Today we know only some require that. So it helped somewhat.

> were the savings/benefits of doing this actually worth the engineering costs involved in the end?

Like you said, very hard to know. In our case, we had a DevOps Cloud guy working a full time job, so, it was not noticeable. Reason being, Probably,

[1] Because although he had problems to solve on all clouds, clouds deployments eventually get stable enough, So pressure was spread.

[2] Although all clouds still need constant maintenance, it's a-synchronic (you can't plan ahead when AWS EKS K8s will force a new version), so pressure was spread out and it never stopped client feature building.

But who knows, maybe for other architectures or a bigger company, it would have become noticeable.

gitgud · on April 7, 2022

Also cold start times for serverless differ greatly between those cloud providers. AWS is < 1 second, whereas Google Cloud is 5-10 seconds

spyspy · on April 7, 2022

This is pretty misinformed. Each provider has multiple “serverless” offerings and cold start time has much to do with your specific application and what it’s doing on start up.

topicseed · on April 7, 2022

None of my workloads (5x Cloud Run services, 10's of Functions) have anywhere near 5s cold starts. More like 2s with network latency.

mshenfield · on April 7, 2022

Thanks for acknowledging how much harder this is when you use a cloud-specific feature. Modifying your codebase to migrate off some cloud specific service seems like it would be by far the hardest part of switching clouds.

wallfacer120 · on April 7, 2022

Thank you for this very useful answer!

bartread · on April 7, 2022

Yeah, but it was only for a side-project, so I only had a single VM to migrate.

I went from AWS (cost ~£25/mo) to Microsoft Azure DE because I didn't want any user data to be subject to search by US law enforcement/national security. I thought the bill would be about the same, but it more than quadrupled almost overnight even though traffic levels, etc., were the same (i.e., practically non-existent).

What was happening was Azure was absolutely shafting me on bandwidth, even with Cloudflare in the mix (CF really doesn't help unless you're getting at a decent amount of traffic).

In the end I moved to a dedicated server with OVH that was far more powerful and doesn't have the concerns around bandwidth charges (unless I start using a truly silly amount of bandwidth, of course).

acapybara · on April 7, 2022

+1 on dedicated server. It's so much simpler & cost effective than the cloud.

10 big dedicated servers can probably handle the same load as 100s to 1000s of cloud nodes for a fraction of the cost. Configuration and general complexity might even be simpler without the cloud.

It's not as hard as people make it out to be to set up backup and redundancy/failover.

dx034 · on April 7, 2022

Absolutely. And dedicated servers at many providers (e.g. Hetzner, OVH) can be provisioned much like vms. So the only real difference is that usually there's a minimum contract of 1 month, but often at a price where it's cheaper than running a VM at a cloud provider for 30% of the time.

bombcar · on April 7, 2022

There are two kinds of cloud users - those who treat their cloud as a VM to use, and those who actually use all the fancy API features.

The first group are almost always better served by dedicated VMs or hardware from a provider specializing in the, if the VM is long-lived.

dx034 · on April 7, 2022

I'm still not a fan of the second way. If you develop all your software to tightly integrate with AWS, you might save time developing the software but create a huge amount of technical debt.

Managing your own infrastructure (with dedicated servers, so no hardware management) isn't too hard, even if you're a small shop. And managing a fleet of AWS services isn't necessarily less work.

Maybe there's a reason all ads for cloud tend to compare it to running your own data center. Because once you get rid of hardware management, it's not really that much easier being in the cloud, at the risk of lock-in and huge surprise bills.

philliphaydon · on April 7, 2022

It’s not technical debt if it’s making you money. I would much rather solve the business problems than managing services myself.

dx034 · on April 8, 2022

It's technical debt if you may need to change your product because a third party makes changes. The beauty of software with little dependencies is that you can run decades old software on a system just fine with no need to regularly refactor.

I know how tedious it is to maintain decades old enterprise Java software, but from a cost perspective, it makes much more sense to keep those rather than constantly refactor to chase the newest trend.

As an example, if you had a software that was written 20 years ago to store data in a relational DB, updating it to work with current versions of that database system won't be much work (if any). If you rely on managed services, I wouldn't be too sure that you get away that easy.

philliphaydon · on April 9, 2022

> As an example, if you had a software that was written 20 years ago to store data in a relational DB, updating it to work with current versions of that database system won't be much work (if any). If you rely on managed services, I wouldn't be too sure that you get away that easy.

This very much depends. In the example you gave, nothing changes if you used managed services or not.

But you could argue that the 20yo software is technical debt preventing the upgrade of the database due to the source code being lost, the library used to connect to the latest version of the database doesn't exist requiring a rewrite in a moderm language or framework. Etc.

Technical debt really is about code that cannot easily be modified to adapt to the requirements of a business.

If you wrote some code, and it was trash, made no sense, in an obscure language that few people know, with no comments. Yet it ran for 10 years flawlessly with everyone too scared to look at it, but made the business money. It's not technical debt until it needs to be changed/modified.

Spivak · on April 7, 2022

> Managing your own infrastructure (with dedicated servers, so no hardware management) isn't too hard.

If you have infra skills then absolutely, it's way simpler to manage. But infra people like don't really fit in "small shops" because the price tag for one of us is (depending on the cost of living) anywhere between a quarter and half a mil total comp. And if you ever want them to take a vacation not on-call you'll need at least three. I say this against my own interest, just go with the managed services and consider it "good problems to have" when you feel the itch to hire some infra person to clean up the debt.

dx034 · on April 8, 2022

Maybe in SV the comp is that high. I only have insights into European markets, but here the comp doesn't differ too much from programmers. And an AWS expert won't be cheaper than someone with knowledge how to manage infrastructure.

And yes, you'll need three people, but not full time. From experience I can say that even with a hundred servers, it tends to be a small chunk of each of their time. And if you have a redundant system and don't deploy on Fridays, the chance that someone has to respond to a call on the weekend is pretty much 0.

phillu · on April 7, 2022

That's something I find so fascinating. The "cloud" will almost always be more expensive and "not worth it" if you are only using the IaaS services. I mean, look at the numbers, everyone sees that.

Cloud only ever is worth it when one uses the higher-tier services, like AWS Lambda and the likes. Even running Kubernetes in the cloud is only semi worth it, because it's not high enough in the stack and still touches too many low-level IaaS services.

Of course, higher tier means more vendor lock-in and more knowledge required and all that. But if you are not willing to pay that price, then OVH, Hetzner and the likes will have better offerings for you.

ChasingEchoes · on April 7, 2022

the problem as many have already pointed out around this thread, is that, in an enterprise env. you cant really do that too much anymore. And as a result that starts being felt by non-enterprise shops too.

And you cant really do that because people dont really wanna deal with on-prem shit and server hostings

Tehnically speaking, i am rhcsa certified, i know how to do all of this on-prem, hybrid things. I dont even bother looking at job offers from companies that arent cloud based (even if i would get a 10-15% increase, or more if coming from the financial sector) because, i genuinely cant be arsed to deal with all that bullshit again.

I'm done with caring about disk space, and hw firewalls and configuring bs in linux. Fuck iptables, let me manage everything from a (network) security group. Fuck Traefik and F5 and all this bs, let me just plop an Application Gatway in Azure or API gateway in AWS. Fuck database clusters. At this point, i havent even configured an apache/nginx server in a couple of years. WebApps in Azure are more than fine; and for the rest K8s.

As a result, good "classic" sysadmins are a dying breed even at enterprise level. So they're even more rare and accessible for small/medium sized business. If i go to my IT dept. right now, i can guarantee 80% of them would be completely lost to setup and use an AD, AAD is just too convenient.

That basically leaves you with: move to cloud, or learn how to do all of these things by yourself. And those things take time (to learn and to manage)

It's like deciding to make apps with Perl. Can you do it? sure. But you'll probably have to do it on your own.

dustinmoris · on April 7, 2022

Oh god quadrupled from £25/month for a small low traffic project as you have described sounds like daytime robbery!

bartread · on April 11, 2022

Sure, but think about this: that little side project now has about 100x the traffic it did when my bill jumped.

That still isn't much traffic (at all) in the grand scheme of things, and Cloudflare's lowest paid tier deals with around 80% of the bandwidth. Still, it's not hard to imagine that bill blowing up to several hundred pounds per month had I chosen not to act. That would translate to several thousand pounds over the course of a year. I don't know very many people for whom such an expenditure, particularly if it's unnecessary and avoidable, would be something they'd regard as insignificant.

Putting it into everyday terms: it could have grown into my second largest outgoing after my mortgage. That doesn't really seem proportionate or sensible, so why wouldn't I look for a better deal?

dx034 · on April 7, 2022

It might be different if you're a developer in SV, but I'd say for most people, that's not insignificant. That's more expensive than most hobbies, I wouldn't spend £100 on a side project if there's an alternative.

warent · on April 7, 2022

Switched from GCP to my own server hardware. After doing the math it came out that it would pay for itself in less than a year. Depending on individual usecase, cloud can be prohibitively expensive and running a server for a small business really isn't nearly as hard as they'd have you believe

lapser · on April 7, 2022

Cloud isn't about cost though. Cloud is about value. You can scale super easy when you inevitably need it (assuming you've "made it" anyway - whatever that means). You get a burst of new users, it's trivial to add additional nodes (again, assuming you've set up your infra to be easily scalable).

With own hardware, scaling is not as easy. You'll have to do a lot more around plumbing too. Networking, security, many other things that you'll have to address. Stuff that has already been solved for you.

izacus · on April 7, 2022

That "inevitably" is doing some really heavy lifting in your post. It practically never comes for most companies.

judge2020 · on April 7, 2022

But most companies' dream is to reach that ""inevitable"" point where the cloud saves it; it becomes a bet where they either reach that level of scale, or die trying.

jjav · on April 7, 2022

> But most companies' dream is to reach that ""inevitable"" point where the cloud saves it; it becomes a bet where they either reach that level of scale, or die trying.

And how many die trying due to bleeding all the funding to AWS, instead of running everything off a couple cheap boxes underneath the CTOs desk? I've been in at least one which ran out of money that way.

Don't pay for the future pipedream now, pay for what you need. That "inevitable" dream of near-infinite scale up usually never arrives for most companies. If it does, worry about it then.

lapser · on April 8, 2022

AWS gives away literally tens of thousands of pounds to start ups. Granted, it's to lock them in down the line but getting free credits gets you further than not having credits.

jjav · on April 8, 2022

AWS credits are a toxic trap. BTDT. They'll get you far enough that you are hopefully (from AWS perspective) locked in pretty tight. And then the high monthly costs start to hit you hard.

lapser · on April 9, 2022

Sure, that's true. But I've been in companies where they've had major investments and have a pretty significant number of users, and they still had stupid amounts of credits. One of my previous companies was running a third of their infra on AWS credits, and they still long runway in terms of free credits.

Meanwhile, they had to shake off a £1 million a year contract for the next 5 years for 2 DCs. With AWS they were using less than half of that per year (this includes the credits they had). But even if it wasn't cheaper, requesting a new server took days, not minutes. Scaling was not possible. I'll take the credit and potentially get trapped than having to deal with an inflexible mess that is in-house managed infrastructure.

At least bigger orgs are able to afford it by (hopefully) building a cloud on top of their infrastructure, but outside that, the majority should of companies be looking into the cloud. Whether that be AWS, GCP, or the smaller Clouds like DO, it doesn't matter.

jjav · on April 9, 2022

> But I've been in companies where they've had major investments and have a pretty significant number of users, and they still had stupid amounts of credits.

Your companies have been wiser and more frugal than mine!

In every case, I've seen the credits run out before there was a penny of revenue coming in.

arka2147483647 · on April 7, 2022

Surely if they are that succesfull, they also have the resources to redesign/upgrade their backend?

jenscow · on April 7, 2022

Yes, but what about the million new customers you're going to get next week?

dsr_ · on April 7, 2022

Having plans for what to do in case of a success disaster is good. Spending in expectation of having a success disaster can be disastrous.

10000truths · on April 7, 2022

Funny, but honestly, a single dedicated server should easily be able to handle a million users, for most CRUD apps.

Nextgrid · on April 7, 2022

The cloud has normalized terrible underpowered VMs, so many new developers may just not be aware of how much performance a real dedicated machine has - even a relatively "mid range" one (i7 & 16 GB RAM).

jjav · on April 7, 2022

> The cloud has normalized terrible underpowered VMs, so many new developers may just not be aware of how much performance a real dedicated machine has

This. I'm now seeing many younger developers with no exposure at all to hardware servers, only underpowered cloud VMs.

Not sure how to solve this, but I certainly enourage everyone to spend time now and then benchmarking cloud setup vs. local hardware to at least understand the performance spread and cost tradeoffs you're making.

I've seen too many setups paying 4 and even 5 digit US$ monthly bills to AWS, for workloads that could have been served off a single $1000 box without it breaking a sweat.

moooo99 · on April 7, 2022

This! I benchmarked a whole bunch of different cloud providers for fun (& my bachelor thesis) and was impressed by how bad some cloud VPS perform. Considering the really steep price to get any kind of significant memory/CPU resources with the major cloud providers as well as the steep bandwidth charges this little experiment was eye opening

warent · on April 7, 2022

Would be very interested in reading this thesis, my email is in my profile

tasubotadas · on April 7, 2022

Could you please share your thesis? Maybe over email?

moooo99 · on April 8, 2022

I appreciate the interest, unfortunately I can’t share as it is under a non disclosure. But since there seems to be so much interest in the benchmark part I will see if I finde the time to re compile the data and publish it as a blogpost

kebsup · on April 8, 2022

I'd be interested as well.

dx034 · on April 7, 2022

Funnily enough that was a use case for dedicated servers I encountered. Usage based pay is great if you have an unlimited budget, but most companies would rather have predictable cost and not serve every traffic spike.

Especially for internal use (CI/CD, Analytics), you'd rather want to queue a few things up than always having to consider your budget when you want to run something.

myzreal2 · on April 7, 2022

Doing it right now. Entire company is migrating from AWS to Azure for reasons I can't discuss, and I'm currently tasked with this migration in the team I am in.

Honestly? It's quite fun. Despite considering myself more of a programmer than devops, I really like the devops stuff - as long as I'm doing it as part of a team and I know the domain and the code - and not being that general devops guy who gets dropped into project, does devops for them and gets pulled into another one to do the same.

Working out all those little details of switching from AWS to Azure is fun and enjoyable and I also feel like I'm doing something meaningful and impactful. Luckily there's not much vendor-locking as most of the stuff lives in k8s - otheriwse it would be much trickier.

busterarm · on April 7, 2022

AWS to Google Cloud. Already mature product (public company). Many potential customers are strongly Amazon-averse. Switching to GCP won some deals that were being lost otherwise.

Anybody's cloud strategy should try and stick to the most basic services/building-blocks possible (containers/vms, storage, databases, queues, load balancers, dns, etc) to facilitate multi-cloud and/or easy switching.

Not that each cloud doesn't have its quirks that you'll have to wrap your head around, but if you go all in with the managed services you're eventually going to have a bad time.

bgorman · on April 7, 2022

I concur, in my experience the biggest driver of growth for Azure and GCP is that customers of SASS companies and consulting companies make it a requirement to choose anyone but AWS. Legacy companies are terrified of Amazon.

Google does have some innovative big data products like BigQuery and Dataflow. In general choosing GCP over AWS shouldn't hinder a companies growth at this point IMO.

Cipater · on April 7, 2022

>customers of SASS companies and consulting companies make it a requirement to choose anyone but AWS. Legacy companies are terrified of Amazon.

Is there a particular reason for this?

kakwa_ · on April 7, 2022

Amazon sell a lot of stuff and consequently compete with a lot of legacy companies: Brick and mortar retail in nearly all fields (fashion, food, tools, appliance, books, etc), other online marketplaces (often with a mail order past) and even some tech companies.

Generally, companies are unwilling to give money to their competitor, reinforcing it.

They also might want to avoid some PR issues, as using your competitors product can lead to juicy stories depending on the situation/field.

If they are somewhat paranoid, they can also be reluctant to have their data accessible by their competitor (and I'm not talking DBs access there, even something like ELB logs can give valuable information, iirc such story did pop a few years)

Working for a marketing SaaS solution, requests from our clients to not be hosted on AWS are still quite common.

judge2020 · on April 7, 2022

Apparently Walmart and Target stay away from AWS but most of it seems to be FUD regarding, I guess, Amazon potentially targeting their competitors and their suppliers to steal data stored in AWS/use it for strategic business moves. It would be a swift death for AWS if they were found to be doing this, though, so imo the fear is unfounded.

https://www.forbes.com/sites/andriacheng/2019/07/14/amazon-a...

dokument · on April 7, 2022

Netflix is hosted on AWS to name just one such instance. I have heard that internally the customer names are obfuscated, including other Amazon services. Indeed it would be a death sentence to prioritize themselves.

hpeinar · on April 7, 2022

Moved a project with around 600k monthly users from Heroku / Google (split setup) to full AWS setup.

Whole process took around 3 months, that was from start of creating AWS account to finish when all production environments were running on AWS and Heroku was "shut down". There was some planning ahead of this as well, so actual time varies.

Heroku was heavily limiting platform (for example, they didn't and still don't support http2) and we needed more control over our infrastructure to support further growth without paying enormous costs (for example, redis prices in Heroku are just mind-blowing).

Also as we were about to open few new markets, Heroku would have required a lot of manual work to get everything working, something which is really, really simple with Kubernetes.

Our monthly costs did go up vs what we had at Heroku at that time, but we're getting a lot more control and bang out of the buck.

Regarding convincing stakeholders, you really need to have good reasons to do it. These kind of switches are not cheap nor easy and come with bunch of risks. The easiest thing to sell is always pricing, but in that case you have to show calculations (big guys like AWS and Google have pretty decent calculators you could use) which show the switch is worth it.

As I was moving from small player (Heroku) to a big player (AWS) I also had other good reasons (better CI, better logging, better performance overview, more control in general). So it really improved a lot of things for the developers, devops and users.

safeerm · on April 7, 2022

What are you using for compute EKS and Fargate? We just helped someone switch from Heroku to AWS and they dropped their cloud costs by 67%. This is using ECS + EC2. Fargate is typically 2x more expensive.

hpeinar · on April 7, 2022

Spot/Reserved instances are used.

I probably should have clarified that the extra cost was expected as we did a lot more in AWS than we could in Heroku, we used the switch to start using bunch of stuff AWS offers like Lambda functions, CloudFront, RDS etc. Stuff that we just didn't (and couldn't) use on Heroku, thus didn't pay for it.

As the purpose of the switch was to get more control, features and out of the Herokus "black box", higher costs were expected and perfectly normal.

nijave · on April 7, 2022

Is that taking spot/reserved instances into account (for both). Afaik Fargate supports both of those

safeerm · on April 7, 2022

Without Spot/Reserved incentives. Could get even more if that’s done.

pnathan · on April 7, 2022

yes.

I've done Aws, Azure, Google.

My basic impression, as a software engineer/site reliability engineer, is that Google >> AWS >> Azure.

This relates to sophistication of offering and design of cloud.

The dominant questions look like this: - what is the familiarity to the infra people - can we implement appropriate governance concerns - how tightly bound is your important code with the specific cloud?

I have generally focused on Kubernetes in the last 5 years, to allow the service layer to be relatively portable. This is very useful in the switching/migrating question.

My general thought process is not to use cloud services unless its very obvious (ec2, s3, etc); prefer to have k8s services provide that capability and use the cloud provider as portable COTS.

wazzaps · on April 7, 2022

DigitalOcean -> Hetzner Cloud. Simply realized it's much cheaper for my use case (single instance running everything), even doubled my ram from 2GB to 4GB and it was still cheaper. It's also a European company which helps (more trustworthy IMO). Also took the time to simplify my deployment, which was nice.

brianhorakh · on April 7, 2022

I Was a very early aws client (first year of ec2). I moved to azure for better ai/cognitive features, sso, ad blabla, also ms being better open source steward, better docs, better vscode plugins, also felt aws re: kibana/elastic license is/was not aligned with my desire to not live in a dystopian version of the future.

It was painful but Az has improved a lot of the sharp edges I encountered.

milesward · on April 7, 2022

I've helped around 900 companies make the switch (others to GCP in my case), and I can confirm based on their results over time that the differences are not as small as they might seem. For so many, it's just ease of use and efficiency to get things done; for others its attention and partnership; for a third cohort its absurd cost advantages; and a fourth its performance and reliability. Our customers see gaps in one or many of those four areas? They move.

wirelesspotat · on April 7, 2022

I migrated a previous company's infrastructure from AWS to Fly.io

Our AWS bill was the main reason. It was far higher than it should have been for the traffic we were serving. Even after we'd halved our AWS bill (the original bills had been crazy), it was still kinda high

Fly was a pretty clear choice when we looked at the lower costs and ease of transitioning from single-region to multi-region infrastructure

I'd been nudging the CEO about doing a migration for about a year before we decided to make the move. When I found that I couldn't really get our AWS costs any lower and did a full cost estimate of Fly vs AWS, the wheels moved reasonably quickly

The CEO primarily cared about lowering our monthly costs and being able to do the migration reasonably quickly (~1 month)

heywire · on April 7, 2022

I know of several retailers and companies serving retailers that switched away from AWS around the time they bought Whole Foods. Before our company switched away, we had multiple retailers say that they would not use our services hosted by one of their biggest competitors.

thatwasunusual · on April 7, 2022

Switched from Amazon to Google because I hate Amazon more than Google.

Feature-wise, I'm just as happy. However, I trust Google more, but that probably boils to my hatred again. :)

xtracto · on April 7, 2022

Interesting that you trust Google more in this regard. Given Google's terrible history of deprecating products, I would not trust any "real world" business to any of their services.

Also, at some point I was playing with Serverless in both Google and AWS. Google's Serverless examples were broken (Google cloud was returning 500 errors) while the same stuff in AWS worked smoothly. That left me with a bad taste.

thatwasunusual · on April 8, 2022

> Interesting that you trust Google more in this regard. Given Google's terrible history of deprecating products, I would not trust any "real world" business to any of their services.

That doesn't apply to me, because I have never used any of Google's deprecated products. I assume that's because they haven't (or have they?) deprecated any of their cloud services.

> Also, at some point I was playing with Serverless in both Google and AWS. Google's Serverless examples were broken (Google cloud was returning 500 errors) while the same stuff in AWS worked smoothly. That left me with a bad taste.

Good for you!

I have used Google Cloud Run for more than a year now, and can't be more happy. Never had problems with AWS either, which means that there are at least two cloud providers providing the same-ish service that lots of people can enjoy.

milesward · on April 7, 2022

>Dozen of decade+ billon+ customer contracts at this point make shutting GCP down a silly thought.

Really crappy that the demos sucked tho, sorry :(

xtracto · on April 7, 2022

https://www.aljazeera.com/economy/2020/7/8/google-shut-down-...

https://cloud.google.com/support/docs/shutdown

Also, there have been plenty of Google products with paying customers that Google has shot down.

At this point, I have 0 credibility in Google.

thatwasunusual · on April 12, 2022

> Also, there have been plenty of Google products with paying customers that Google has shot down.

I highly doubt they will shut down a $5.5B business. Either way, because my applications are cloud-ready, it's easy to switch to another cloud provider again.

Derelicts · on April 7, 2022

I've done the same just with a different provider. And I see a trend, people go for the lesser evil out of the ones available.

kevin_nisbet · on April 7, 2022

In startup mode, we switched multiple times, from Azure to Google Cloud to AWS, chasing those startup credits. Things that were easy to move tended to move quickly, but systems that lost owners or priorities didn't tend to get moved all that quickly, so left a fun legacy of a system or two on an old cloud account.

Growing out of that mode has the team mostly focused on a single cloud provider, with a few things that'll remain on alternatives because they're better suited, and projects will clean up the rest in a couple of years.

thom · on April 7, 2022

There have been two circumstances where we’ve seriously thought about it, but as of yet haven’t changed:

1) Under some circumstances we might want to give very stringent uptime guarantees for some systems, and I do not trust providers to have zero global (cross AZ) outages. Having a hot standby or even load balancing across clouds could be tempting.

2) One cloud provider is very keen to get into our sector and has made extremely generous overtures which we’d be silly to completely ignore.

As I say, not something we’ve followed through as yet but both are serious considerations.

pavelevst · on April 7, 2022

My team was migrating 1000+ VMs from aws to gcp, mostly for cost efficiency (and adoption of k8s for even better cost)

We used kafka mirrormaker 1, with 2 ways sync (new cluster have separate topics for write that are synced to old cluster, and all topics from old cluster synced to new cluster) For postgres failover switch to new master required about 1-2 minutes of downtime

We migrated ~80 microservices within 8 months and now our infra cost about 1/4 of what we paid to aws, completely worth the effort!

drusepth · on April 7, 2022

I switched from Heroku to AWS, then eventually back to Heroku. Heroku to AWS was for cost reasons (cut monthly costs by roughly 35%) but wasn't enough savings to justify hiring a devops person. As soon as there were too many issues I didn't know how to fix (setting up everything Heroku offers was hard and likely done wrong, which made ongoing maintenance some level of hell), I switched back to Heroku where the lack of devops needs basically paid for itself.

safeerm · on April 8, 2022

This is very, very true. One of our customers described it really well “The nightmare of DevOps kept us from managing AWS directly, but with TinyStacks we can scale a billion+ requests a day in audio advertising leveraging the full power of AWS.”

digianarchist · on April 8, 2022

I would checkout Render and DO App Platform. Heroku has actual competition now.

randomtwiddler · on April 7, 2022

Moved from AWS to vultr. Primary reason is unsafe billing practices at AWS. I'm a small shop and felt very uncomfortable about the potential of getting a surprise $xx,000 bill from AWS. The hack in December tripled my anxiety. Changed my password and migrated in January. Closed out my AWS account once every last service was migrated.

I mostly use only basic services so pretty much any cloud provider can fit my use case. It took some time but I have peace of mind now.

judge2020 · on April 7, 2022

> The hack in December

What are you referring to?

John23832 · on April 7, 2022

Went from AWS to GCP. Everything we deploy is on Kubernetes, so a combination of Terraform and ArgoCD was pretty easy to move. It was pretty much push button.

mountainriver · on April 7, 2022

Definitely a space where Kube shines!

cnj · on April 7, 2022

Yes. We migrated from Rackspace to GCP. When I joined the company, it was clear Rackspace was loosing the cloud race, but back in 2012/2013 it was a strong contender.

Sometimes one adopts a technology too early...

We had a longer selection process between AWS, GCP and Azure. AWS was difficult because some of our customers see Amazon as a competitor. However today we also offer the option to run on AWS. GCP won over Azure.

oauea · on April 7, 2022

Yes, switched from AWS to Google to Azure. Don't switch to Azure unless your employer forces you, you will regret it. Google is great on a technical level though, especially if you do things with Kubernetes.

moring · on April 7, 2022

> Don't switch to Azure unless your employer forces you, you will regret it.

Can you give any details? Pricing, reliability, weird quirks you have to program around, ...?

oauea · on April 7, 2022

Many operations are INCREDIBLY slow. Creating a VM. Mounting a disk to a VM. All take easily 10-30x longer than I'm used to with other clouds. This is especially annoying with Kubernetes as it likes to move disks between VMs which... takes a while on Azure.

Many, many, backwards incompatible changes in their Kubernetes platform. I've had to recreate clusters about twice a year so far. Lately it's been better since they finally got node pools working (about 2 years after all of their competitors).

They can't get their network stable. Things like `kubectl port-forward` or `kubectl logs` hang after 4-30 minutes of inactivity (i.e. the tunnel is open but no packets actively being sent) which, according to Azure, is "working as expected". This makes the tooling utterly unusable. It has to do with the way Azure's load balancers deal with idle TCP connections.

Also, their support engineers are unwilling to help you unless you run Windows. They always insist on remoting into your machine using some Windows utility, even though the issue is with their cloud instead of my machine.

mijoharas · on April 7, 2022

> Also, their support engineers are unwilling to help you unless you run Windows. They always insist on remoting into your machine using some Windows utility, even though the issue is with their cloud instead of my machine.

That last point sounds like enough of a reason to never touch then with a barge pole.

flumpcakes · on April 7, 2022

In my experience a lot of difficultly with Azure arises because a business is probably already using Azure to manage it's IT infrastructure.

You then get into sticky situations where IT are unhappy handing over admin access to the Ops teams: for example refusing access to AAD because it is integrated with the _corporate_ domain.

This might seem fine on the surface, it's just a people issue, but it can become very tiresome when dealing with Azure resources their permissions.

Worse still is some Azure resources use AAD almost like a data store: such as b2b and b2c. If you have write back enabled on your AAD (as most companies would - otherwise users wouldn't be able to self service forgotten passwords) you will _apparently_ clog up the on-premise domain with foreign objects from your b2b configuration.

Of course you can get around all of this by having a separate tenant in Azure for SaaS teams/Ops only. But you introduce the headache of management (should both tenants be under a super enterprise tenant?) and single sign on (security might say there is only one user login for everything with RBAC & MFA managed from one place... now you have to join the tenants somehow...)

arethuza · on April 7, 2022

I'm pretty sure that B2C is effectively a stand-alone instance of AAD that isn't integrated automatically with other applications and on-premise WSAD.

BlueTemplar · on April 7, 2022

Heh, this reminds me of the mess with separate Microsoft Windows, Outlook and Skype personal accounts...

deathanatos · on April 7, 2022

The biggest problems: things don't work, things don't compose, and support is unknowledgeable & ineffective.

Things don't work: outages, bugs. So many I'm not sure where to start: ACR: slow as molasses. I don't know why they didn't build it on blobstore. Many queries are clearly O(images) even when they shouldn't be. Throughput it terrible and made worse by inane pagination; listing images, for example, has a throughput of something like 60kbps. And "b", as in bits. Minutes for 3MB of data. It's absurd; they think that's "working as expected". AKS: they manage the API server in AKS, and we find it is routinely non-responsive. We went through a quarter-long support ticket, which went back and forth between "you're putting too much load on the API server" -> "we don't think we are, what load is there?" "here's the top queries" and they're all queries from like, the cluster controllers — which are also managed by AKS/Azure. -> "well there's too much load!". App Gateway: normally stable, but had an outage when Let's Encrypts old root expired. (We were using a cross signed cert — i.e., our cert was valid, but App Gateway failed it, i.e., a false negative validation.) They never acknowledged the outage, and the support ticket we filed didn't get a response until like days later — missing support's SLA — meanwhile, some engineer somewhere clearly fixed the service, as it started working all on its own. ACR: we used to get 500s. IDK if these still happen or not, as we retry-looped most of the spots that were hitting them. Support response was ridiculous "what's your ISP?" "… this is inside your network, Azure, and my ISP wouldn't cause you to serve a 500." "Maybe the image is too big" "The image is ~100 KiB" … global AAD outages, the status page going down while the twitter account is like "check the status page", the Portal has issues (a few days ago listing subscriptions just returned 0 rows. Like, okay, I guess I'm not looking at those today), the activity log will occasionally just return errors, or today, it returned 0 entries where I knew there to be entries.

Composability: a cloud provider's job is to offer bricks from which I can build whatever infra my company's needs demand. But Azure constantly says "well, no, you can't put those bricks together that way." IPv6 anywhere in your vnet? You can't add a managed PostgreSQL server to that vnet, cause not only does it not support IPv6, you can't add it to an IPv4-only subnet in a dual-stack vnet. Like, the entire point of dual-stack…; also, when we attempted this, the API request took 2 hours to fail, and failed with "Internal error, please retry", which we then did, like chumps. 4 hours later, support ticket. AKS will add new features (that ought to have been there from day 1) like nodepools … but only to newly created cluster. You want to take advantage of that? Too bad! Recreate that cluster from scratch!

Support: Azure has no meaningful tooling for handling bugs in their services. The only hammer they have is support, and by god it's going to hit that nail. Support might (if you can get them to admit that yes, shit's broken) field a bug report, but then the support ticket is closed. Is the bug fixed? How will you know? IDK. Also, AFAICT, certain products you just can't open a support ticket for; notably, the portal. It isn't an "Azure service", so it isn't in the list of things to select from. Also, they override the mouse wheel on the list of services, so scrolling ~1 "detent" on a trackpad results in the list scrolling at Mach 4. Support tickets lack URLs, so they're unlinkable. Occasionally whatever the agents use to view info on tickets gets desynced from the ticket, and new replies in the portal are black-holed (but email replies still work). You can't put ">" in a support ticket, it's not allowed. You can't upload certain file formats, it's not allowed. (E.g., want to send video of a bug? Not today!) SLAs are regularly missed, and the responses often ask for information included in the opening ticket. And … the agent's grasp on English is frankly terrible. (I'd accuse them of out-sourcing it, but we once had an agent on a ticket go dark on us … because he was in Texas when Texas losts its power.)

Honorable mention: My God, AAD is trying to be as complex as can be. Apps, Service Principals (which are needlessly and confusingly just called "Enterprise Apps" in the UI! — oh and the search box for that page doesn't work), Roles, Permissions, Role Assignments, Tenant, Subscriptions … oh my God. Like, AWS IAM is frankly a terrible implementation, but Azure AAD makes AWS IAM look amazing.

My entire 2+ year experience with Azure has made me an ardent believer in AWS, and willing to try GCP.

I need to write this all in a blog post.

oauea · on April 7, 2022

> I need to write this all in a blog post.

Please do. It would be very valuable to find these things when searching about Azure, because this absolutely matches my experience.

0x500x79 · on April 7, 2022

I had a company that wanted to move from AWS to GCP. This was a top-down decision that was made with an incompetent "tech lead" stating that we could move everything and save money with the 10 year commit hundred million dollar contract in Google.

It failed, horrendously. Even though multiple people in the organization were calling out how bad of an idea it was they still moved forward. Google has some nicities with how things connect, setup, etc, but at the end of the day they are cloud providers and not everything they provide is a silver bullet.

This project was canceled after 3 years after spending millions on migrating since the migration was not a drop-in replacement (no one thought it was except the "tech lead").

There are a lot of things besides tech that can affect these projects. If you hire AWS experts and they are having to be AWS experts expect to need to hire GCP experts.

martingoodson · on April 7, 2022

I once worked with a CTO who decided to move from AWS to GCP (and also move from Spark to Hive and python to scala, at the same time). That guy was an idiot.

(At one point somebody accidentally spent £30,000 in data transfer costs with one key press.)

The project was never completed and the CTO just moved on to another fancy CTO position.

klysm · on April 8, 2022

Why on earth would you go from spark to hive?

dkechag · on April 7, 2022

At work we moved from AWS to GCP for pricing reasons. We are still paying loads more than before moving to the cloud, but it's hard to find good sysadmins nowadays that don't want to do everything on the cloud. For personal projects I've moved things to Linode and Digital Ocean, as they provide quite decent value. Actually, for a comparison of AWS/GCP/Azure/Linode/DO/Tencent/Ali value & performance, check out an extensive comparison I ran recently: http://blogs.perl.org/users/dimitrios_kechagias/2022/03/clou...

greenail · on April 7, 2022

What is the difference in expense between the "loads more" you are spending and hiring and managing a sysadmin? Ballpark is fine, interested to know if it is 1x,10x, 50x?

dkechag · on April 9, 2022

So, we went from a cost of a bit over $3k/month for infrastructure, to almost $15k/month (would be more with AWS). Yes, you have to buy the servers, yourself at extra cost, but on Google most people run instances on Haswell/Skylake, which is >5 year old stuff, so a quick calculation for buying top of the line every 5 years (which for the first couple of years at least gives you faster hardware than on GCP) comes to about extra $2-3k/month. Also, we had 1 (very good) sysadmin and 1 other developer who assisted part-time sys-admining, to a cloud that has some extra bells and whistles, but requires 2 full-time sysadmins (we are temporarily left with 1 and he can't keep up). Part of the problem is that the whole architecture was very fine-tuned to run on our customized rack servers (it was doing happily for almost 15 years), so a lot of things did not translate that easily in the cloud. I guess if you have a small system, or designed from scratch for the cloud a single cloud platform person might be enough. Overall, there are some extra disadvantages beyond cost - e.g. Cloud SQL manages some things for you, but there is more lag compared to when our super fast DB server was on the same rack as the application servers, and other such little performance things that we could fine tune when we had control of the hardware.

jpgvm · on April 7, 2022

Yes. Huge AWS -> GCP migration.

Why? Incoming CTO signed a massive GCP deal probably because it was marginally cheaper than AWS (while probably ignoring the migration costs).

amir734jj · on April 7, 2022

Same thing happened to my old company (large insurance company). Moved from Azure to AWS + Terraform/Kubernetes because of cost and "cloud independent" nature of Terraform. The whole IT department spent 2+ years moving hundreds of (relatively modern + legacy) services and applications. Some services couldn't move because they were managed by third-party. I am pretty sure they didn't factor in the cost of migration.

jacquesm · on April 7, 2022

Always look for kickbacks if there are cases where the company ends up being out more money all-in. These decisions should not have been made in this way, and it isn't always incompetence that drives this.

jesterson · on April 8, 2022

Yes, moved away from rackspace to IBM Cloud. Since then use several cloud providers for redundancy.

Every provider has severe downtime (when even phone lines are not operational) so we do failover across several providers. Saved a lot of uptime for us.

Also we do not use (almost) vendor-specific solutions. Almost everything your cloud provider sells up to you can be achieved without using the provider lock. Will save time later when provider quality goes into sh*t (eventually happens) and you have to migrate your infra somewhere.

kitsune_cw · on April 7, 2022

Consolidated all of my small cloud VMs from DigitalOcean, GCP, and AWS into a single dedicated server on Hetzner. It costs way less and I don't have to pay for egress anymore.

dagw · on April 7, 2022

If so why?

Company policy change. We went from each office (or even department) basically doing their own thing and having their own billing accounts and negotiating (or not) their own deals, to one central cloud deal with central billing and administration.

The change from everybody doing their own thing to having a central devops strategy we all had to work within was a much bigger change than the actual changing of cloud providers.

k101 · on April 7, 2022

We decided on a multi-cloud infrastructure. Now the team is thinking about choosing a tool that would allow you to migrate and do multi-cloud in two clicks. We try to go to not very large clouds, because first we want to support small businesses, and secondly, sometimes it is more convenient for regional legislation. The main clouds are Scaleway, DigitalOcean and your own servers.