Hacker News new | past | comments | ask | show | jobs | submit login
AWS staff spending ‘much of their time ’optimizing customers' clouds' (theregister.com)
254 points by MISTERJerk2U on April 17, 2023 | hide | past | favorite | 258 comments



This is more of a side effect of providing self-support solutions, and a trend in Customer Support departments. Amazon Support has been on top of the lists and awards for best support because they have proper KPIs that incentive customer retention/satisfaction over just number of contacts resolved.

So when you have good support/sales staff and most contacts are solved with self-service, what remains are contacts about complicated cases and inquiries.

I imagine that they always had a lot of contacts about optimizing spending, and policies to help and retain those customers accordingly, and now the percentage of time spent in those is higher, probably because the other types of contacts are now self-service.


AWS doesn't even have a way to report a bug... what "on top of the list" are you talking about? Who else is on that list?

AWS support is atrocious, unless you cozy up to them somehow to get special treatment. If you are a "regular" customer, you get zilch. AWS offers best performance for the best price, that and they are the largest service provider is what makes them succeed. They don't need to spend anything on support -- it's not cost effective.

My personal experience: we provided a service through AWS marketplace related to storage. So, we are not your "regular" customer, had to pay them some for accreditation... and here's the thing: it's probably something about how we create EBS snapshots, but about every thousand or so snapshots creation stalls. Doesn't fail, doesn't succeed, just hangs forever. There's nobody to talk to in AWS about this. People who "certified" us (what a joke...) cannot do anything about this problem.

And there's more stuff like that. Sometimes some resources won't delete first 1-2-3 times you try. Some other resources are created, but don't work once in a blue moon. And all you can do about it is just curse and try again.


> AWS doesn't even have a way to report a bug... what "on top of the list" are you talking about? Who else is on that list?

You gotta love these assertions at the top of a post that flag people who confidently are saying they have no idea what they're talking about.

You can report bugs via your solution architect, via support, or via the Re:Post bug report tool.

> AWS support is atrocious, unless you cozy up to them somehow to get special treatment. If you are a "regular" customer, you get zilch. AWS offers best performance for the best price, that and they are the largest service provider is what makes them succeed. They don't need to spend anything on support -- it's not cost effective.

Even basic support is great, I've had most of my issues fixed within 8 hours. If you pay for business or enterprise support you'll have a dedicated person you deal with. Try comparing that to GCP...

> My personal experience: we provided a service through AWS marketplace related to storage. So, we are not your "regular" customer, had to pay them some for accreditation... and here's the thing: it's probably something about how we create EBS snapshots, but about every thousand or so snapshots creation stalls. Doesn't fail, doesn't succeed, just hangs forever. There's nobody to talk to in AWS about this. People who "certified" us (what a joke...) cannot do anything about this problem.

If you were a solution partner you would have had a partner solution architect who would take point on this problem for you. This sounds like your organisation didn't accurately share access to this resource with the right people.

> And there's more stuff like that. Sometimes some resources won't delete first 1-2-3 times you try. Some other resources are created, but don't work once in a blue moon. And all you can do about it is just curse and try again.

I've been AWS facing for 11 years and I can count on one hand the times this has happened. Did you file a support ticket?

Source: Work in an AWS consultancy, before that was running tech for a startup on AWS


> You can report bugs via your solution architect, via support, or via the Re:Post bug report tool.

My solution architect? Do you realize that most AWS customers don't have an assigned solution architect? Also, yes. We did talk to our solution architect. That's the part where "what a joke" comes in. Solution architect owes you nothing. Not even an acknowledgement that the development team on the other end received the information. You will never see the ticket in any kind of a bug tracker, no indication that any work on your problem has been done. Like I said, zilch.

Support in AWS doesn't exist. That is not a thing. There's support of financial issues. There's no such thing as tech support in AWS.

Re:Post is dead. Nobody ever replies there. No developer within AWS has a duty to look into that board. It looks like it's mostly intended for customers to help each other out. It's more like a place you can go to vent, if you feel very frustrated. Nobody owes you any follow up, nobody notifies you if things are fixed. It's not a substitute for bug tracker or customer support.


Every year, managing 4000-5000 instances at least a handful of instances will fail to transition states for me. Support ticket usually clears it up.


> I can count on one hand the times this has happened

You must never have worked with CloudFormation/CDK. What a shitshow. If you need to do anything non-trivial you are basically guaranteed to need a whole ream of dirty hacks and workarounds.

To be fair, that’s during creation. I’ve never had resources not work after creation.


I work with CloudFormation scripts extremely often in my line of work and while confusing at first they're extremely clean once you wrap your head around them. Like any tool or language there are multiple ways to approach issues, your solution can be clean and efficient or "a ream of dirty hacks", it really depends on how familiar you are with the tools.


CDK itself relies on a good number of hacks and workarounds, some of which are not well-vetted at all in my experience. Some of the workarounds are really a symptom in my opinion of certain resources not being part of CloudFormation itself, when if they were, it feels like it’d be almost entirely a non-issue.

It’s a little odd that at this point in the game, the prospect of “make a thing that depends on another thing” is still not an entirely solved problem within AWS.


it's also worth noting that AWS has support reps split out based on your business sector. working in education related space we have an edtech rep.


For what it's worth I've worked at multiple companies where we had an AWS technical account rep in our slack channels helping us solve problems.

Agreed that their general support (and some of the design and implementation of their services) makes me want to give up and go outside and touch grass instead of working in tech, but as soon as you reach the level where you are in contact with support directly, they're very responsive. It's a weird dichotomy.


A "dichotomy" that is explained by big bucks is not a dichotomy but mere business strategy from AWS's part.


> helping us solve problems.

That is a different thing. We don't need AWS to help solve our problems. We need them to fix theirs and give us feedback on how that's going.


> AWS doesn’t even have a way to report a bug…

Customers are able to report literally any issue at all via support. AWS support is a huge organization. On my service team, there were plenty of instances where bugs were reported by customers.


That sounds pretty nice compared to the GCP world where every issue is your issue and you need to pay for support or forget it. Last time they fucked up, I had to raise the issue in a public bug tracker to which I got one of those "not a bug, wontfix, your problem, goodbye" kind of responses.


Whenever I've needed AWS support it was just as aweful as GCP. For example, 12 hour Aurora outage, support was only replying a generic "waiting for engineering team" answer, SLA only refund about $15 although we ended up paying for all the replicas we tried to spin up from backups and wouldn't start. Architect and account manager didn't give a damn about our issue.


The way to get good support from AWS and Google is the same: find someone on your team who is ex-[AMZN|GOOG] and get them to email their buds.


Or, if you purchase consulting hours from any of the major firms, get ahold of their sales team and have them run the ladder for you. A GTM partner is generally more than happy to leverage their firm's relationship with the public cloud providers in the service of their relationship with you.


Not sure that paying for support is even that helpful. I was responsible for a few tens of millions of dollars in client GCP spend (admittedly, a small fraction of client AWS and Azure spend), and I only spoke to any official GCP reps once -- and they were from a recent acquisition. Our GCP team had closer contacts, of course, but as far as my experience went Google compared unfavorably to both AWS and Microsoft, who were happy to throw SMEs at me anytime I had a question. (Even so, there's a lot I do like about GCP, even if I'm a bit uncomfortable with its long-term prospects.)


Where is the bug tracker? What are all these "there is huge customer support" claims and no evidence?

No. You cannot submit a ticket and wait for developers to process it and get any feedback. You can either get an assigned person (which comes with extra pay and only for customers who are deemed worthy) who will not deal with issues like working with AWS bugs -- their goal is to tell you how to use AWS, but if something in AWS doesn't work, it's out of their hands.

Or you can get assistance with payments. Nice, but not what I was talking about.


We are nowhere near being AWS' top customer, and have had absolutely incredible service from them.

Not only that, but we were proactively contacted by an AWS account team before we had even spent our first $100, and they spent many days providing guidance and support to our team before we even breached $1,000/month.

It may not be the norm, but my experience with Azure has been exactly the opposite. After having an (admittedly small) Azure presence for 4 years now, I'm going through the "front door" of their sales site to try to get an account person from them. So far, I recommend AWS over anyone else 100 times out of 100, 10 days a week.


And did you submit a bug ticket? I bet you didn't. So what's this about?


AWS has some of the best support out there. You don't even have to 'cozy up'.

It sounds like you didn't use AWS support but you got AWS certified consultants.


Some of the higher tiers of AWS enterprise support are INCREDIBLE. They're close the level of support I'm contracted to provide as a full time IC. Stuff like turnaround within hours? 30 minutes for critical issues? It just sounds like they're not a huge account.


Their point was, if you’re a small account you get terrible support. Saying huge accounts great support isn’t relevant for the overwhelming majority of AWS customers.


Not my experience at all - I was a small account. I was paying for support. They went way outside support scope and seemed to think nothing of it, they should tell customers to pound sand more often probably, but it seemed like they are willing to troubleshoot customer issues pretty thoroughly.

Did have some complaints about some things (classic flat network going away if you started with that had some bumps).


We don't have a huge account for our AWS. I'm not even sure of the tiers of support. The account manager we had assigned to us was great.


As a nobody with a few side projects on AWS I found support incredibly helpful- walking me through a long process of how to turn off and lock services when my account got hacked and a huge bill ran up.


I have a mid seven figure/year AWS spend and still have grandfathered "dev" level of support. Haven't gotten the (ridiculously expensive) "enterprise" level support because I'm just so happy with what we get with "dev" plan.


There is a mid level business tier.


We have a grandfathered in "dev" support that doesn't require a monthly percentage. Every plan otherwise has a 3-10% marginal premium.


What are all these declarations of "incredible" and so on? Where's the proof? Where's the bug tracker? I wrote literally that you cannot file a bug, and all answers here "oh, it's so great!" but still nothing about that bug tracker...


No. There's support for financial issues, or you can get support to tell you how to use AWS. If something is broken in AWS, you are on your own. Those people owe you nothing and won't do anything out of the goodness of their hearts because in most cases they don't even have a clue who within the organization they need to talk to in order to get a bug ticket created and processed.

On a personal level: as an individual developer, I have never been even able to reach to anyone from AWS about their bugs simply because the "solution architect" wouldn't even bother to talk to rank-and-file, and other than that the only other option is to vent into a text area in the AWS console, which I'm sure is just some JavaScript for show. You cannot file a bug, nor can you see what other bugs have been filed. Nor can you get any estimates on whether the bug was reproduced, nor anything about the plans of fixing the bug.


It's evident by the replies that many people don't share your experience. I've also had very effective support from AWS, from technical issues to spending optimizations. You seem to really care about whether a ticket exists to track your issue, which I don't really care about if I'm emailing a person and my issue is resolved.

Based on all the feedback you are receiving, it may be wise to reevaluate your points of contact with AWS and make sure you are speaking to the right person, and in the right way. Sorry to hear you've had bad luck until now.


A few years a ago when I worked at a tiny startup(<5) we had direct contact with aws. Not sure if this has changed now


Same experience here, they were even hooking us up with credits. Even when we were tiny they wanted to understand how we used AWS and were available to help with PoCs for products we weren't familiar with.

I interviewed many people in my life and every once in a while a candidate would be an AWS Solutions Architect. To my surprise it turns out that much of their time is spent helping customers implement their cloud workloads.

At $BigCorp when stuff wasn't working I had regularly had AWS bite their nails out trying to understand what was wrong until we got to a working solution.


I work at AWS now (ProServe). But I got my start with AWS at a 60 person company that had business support. The live support was amazing and I often used them as the “easy button”


And you filed a bug in a bug tracker accessible to all AWS customers and was able to track the progress on your ticket and got some estimates on the issue being fixed? -- No, you didn't, because there's no such thing. Stop this unnecessary shilling / wilful misinterpretation of what I wrote.


Hmm, I should use that more often


If you have it definitely use it. Working internal to amazon we have enterprise support on every account and I can page the teams internally. I often have to remind our line engineers to open a live support ticket immediately upon issue (they forget and forget to parallelize). The enterprise support is just great for most things. Whenever we're having an urgent issue, as I said, constantly reminding them to poke support. 80% of the time the support person has analyzed and figured out parts of our issues before I even find the internal contact to ask a question. They're also great at just debugging odd corners of the api. Deploy go wierd? Can' figure out how to get a feature turned on in some of the more complicated areas? they're basically a pair programmer. I've handed them my CFN and they fork their own account and try setting up the infra. I do the same. Massively reduces the Debug time. And people talk all the time about Chat GPT for this, well you're paying for it if you have support, and they have acess to the internal logs/dashboards/etc KB, etc so they're quite fast/good. Also super scrappy.

Internal support (eg contacting dynamo sr devs/product) is better for more architecture type stuff/design and real optimization. Support is great at diagnosing and debugging. If I was external I'd use aws SAs for the equivalent to the internal stuff. But I'm an sr dev so I deal more with the SA/Design side of the house.


How in the world do you get internal support to help you as an AWS employee? My understanding was that there KPIs were measured by helping external customers?

I work in ProServe (app dev) and I touch a lot of different services. Usually I can reach out to an SA or the service teams Slack channel. I’ve been able to get an API bug fixed relatively fast by posting a sample to the service teams Slack channel.

“fast” is relative. Even though the fix was fast, it takes forever to propagate non urgent fixes through out the entire AWS infrastructure.


By internal support I meant the product teams.

When I say fix non-urgent things, I'm patient. AWS has fixed many things for me esp if you look at the 1, 3, 5 year time horizon (I've been there 9). I've got lots of problems, having someone solve them helps, many things can be hobbled along for 1-2 years if you know it's going to be fixed and you can fix something else. Much of my job involves paying attention to roadmaps for aws teams internal and external and guiding my platforms work to best leverage that. For instance we knew R53 resolvers were coming, so we punted on service discovery as r53 resolvers solved it for us. Those engineers worked on another feature instead.


I built a CD service on AWS and I've never had issues with their support. Their automated systems sometimes cause unnecessary support incidents though, and waiting in the default support queue could easily cost a day of downtime for someone. Definitely gives me the impression that they try to steer people towards paying for support by forcing unnecessary interactions.


One more nonsense... Come on! Did you file a bug to AWS? -- No, you didn't. So, why are you answering me about your great experience? You are writing something irrelevant.


Sorry, can't agree with this. The one time a customer support team actually dialed me up to solve a technical issue was with AWS. Support for billing issues was done in roughly 15 minutes after initial report. Support with free credits was done after a perfunctory 10 minute phone call. For a hacked account, account access was restored in 30 minutes, from escalation of the complaint down to resolution.

Imagine getting that kind of support from any of the competition. GCP support was laughably inadequate (apparently they have separate support for Firebase and GCP, and GCP can't support for Firebase), while Azure support is exactly as you describe - great for enterprise, sucks for basic support. As for the innumerable AWS "startup" competitors like Digital Ocean , Fly.io or Render, support ranges from condescending to non-existent. In one of the above cases, support tickets were so backlogged, I was giving support hints to guys stuck on issues for days.


And you found a bug in AWS and managed to file a ticket? Right? And that ticket is visible to other customers, so they don't need to duplicate it if they also find it? And they assigned the ticket, triaged it, scheduled the fix, released the fix? -- No, of course not, because there's no such thing.

Read what you are replying to.


> AWS doesn't even have a way to report a bug

This is not true; we regularly open tickets with AWS for enhancements or bugs.


And where do you do this? Care to give a link to the bug tracker? I don't want to file mine because maybe I duplicate your issues?


You should probably talk to your AWS contact.


This sounds like a good thing. I work for a <1000 employee firm, and AFAIK we're spending in 6 figures monthly with AWS. I cannot even imagine what huge places spend there.

All it takes is bad support, severe price undercutting, or degrading service and we're off to another, or a split. By having people at least feign caring, giving support, and giving suggestions, they're locking us in and getting way more back in revenue than what it's costing them, at least from us.


> I work for a <1000 employee firm, and AFAIK we're spending in 6 figures monthly with AWS.

The argument for 6 figure cloud spend was always "it'll help you not need as many infrastructure-related dev ops engineers", right? Would you say that's true for your org?

Every org I've ever worked at has had 6 figure monthly cloud bill, and the 6 figure monthly payroll bill related to teams working with cloud infrastructure


It's kind of even worse: since public cloud became very popular, the profile of infrastructure programmers changed. A lot of knowledge necessary to run own infrastructure disappeared, being replaced by the knowledge of how to operate the service.

So, for example, things like PXE boot, which would be useful in your own datacenter, but don't work (well) in clouds s.a. AWS become mystery. Obviously, this isn't limited to just PXE.

And since public cloud's goal is to provide service to as many customers as possible while having as few staff infra engineers, the total expertise among general public disappears. Having worked in the infra business for a while -- it's increasingly more difficult to hire infra engineers.


> since public cloud became very popular, the profile of infrastructure programmers changed. A lot of knowledge necessary to run own infrastructure disappeared, being replaced by the knowledge of how to operate the service.

It was ever thus. Previously it was CISCO Solution Specialists and HP/UX Most Valuable Engineers and the like, and all the knowledge beyond the basics was completely tied to the vendor. In enterprise IT, it's still like that for Microsoft.


If I'm running on the cloud why do I care how PXE boot works? How is that helping me? Yes it's lost knowledge/skill areas but it's also not needed.

Similar, I forgot how to config HA proxy and setting up Nginx would take me a few days to re-learn/optimize but I don't need those skills anymore. Just like I don't need to know how to install oracle/pgsql from scratch, setup raid on my computer, compile the kernel, etc to do my day job.

I'm also not debugging the assembly coming out of the compiler anymore looking for compiler bugs (the one time I've done that in my 25 year career, I just pinged one of the eng who worked heavily in that area and he filed a bug to sony 10 minutes later and told me to just add some int ++ / -- and that shifted the compiler bug out of the way. Sony fixed it a month later.

Nor am I a wizzard at add/remove drivers from win98 or setting up a token ring lan.

Germane knowledge comes, and goes. My job isn't to know PXE boot or any other specific tech, it's to ship product...


I think you missed the point where they were saying most infra dev might not know how to do on prem and cloud so it becomes less of a choice to use cloud if cloud skills are the only skills in your org.


That's a very good question. Well, maybe PXE specifically isn't such a great example, but still, this is a good question because the answer isn't obvious.

First, cloud vendors don't really invent the infrastructure from zero. They typically adopt existing solutions to their needs. And they aren't infallible. Every now and then there'll be a bug or an edge case they didn't handle. Alternatively, there will be weird restrictions that you, the customer, would have to go fish for in substantial volume of quite opaque documentation in order to protect yourself from encountering those unfortunate edge cases.

I don't know if AWS internally uses PXE boot. But, suppose they did: this would mean that there'd have to be some specific details of the network configuration that enabled the use of PXE. Also, there could be some pathological cases of interaction between "special" (boot, bootloader) partitions and PXE boot. You wouldn't normally care in VM-based deployment about boot partition as some VM players can even offer to boot your kernel w/o you having a bootloader in the image, so, you may go for a long time w/o ever discovering what your bootloader is configured to do and how it may interact with PXE boot.

Suppose at some point AWS upgrades something in their PXE server / code it runs on clients... and it breaks your stuff. Because this would affect most likely all of your VMs, the effect may be devastating...

But... you didn't even know PXE boot existed. You had no idea that you had to look for some subsection X.Y.Z in the appendix to the user manual that warned you against using a particular configuration in your bootloader, which stayed dormant for a long time.

It's not hard to imagine how this might go financially...

Second, any solution purports to sell to many customers -- the more the better, that's how the solution becomes profitable. Naturally, the solution provider wants to find as homogeneous group of customers as possible, so that with as little effort possible they'd be able to cover most customers. But the reality tends to work against this desire by creating customers who aren't like each other. So, some generalization must happen, and it means that very specific, very individual desires of customers will likely not be covered by the service provider. Now, suppose you, the client, know how the underlying technology works, and thus are able to defeat the unwillingness of the provider to accommodate your very specific needs... or, you don't. Typically, this ends up being not a deal-breaker, but an inconvenience. Often times it's the money you'll pay for the service you don't need, or the resource consumption that could've been avoided.

I cannot think about a good example with PXE, but here's something I crossed paths with recently at my kid's birthday party: I met a guy who works at Gitlab, and that reminded me about a grudge I had against Gitlab for some time. So, Gitlab is an example of service that replaces in-house IT / Ops who'd manage company's repository in this case.

Gitlab, at least initially, gained popularity due to CI bundled into service. This CI knows how to run jobs. These jobs produce logs. The logs can be obtained through API, if necessary. But... if you do this at the time the job hasn't finished, there's no API call to enable paging of logs. So, if you want to display logs as they are being generated, you poll, and each subsequent response will get you 0..N characters, with N ever increasing. This is hugely wasteful, and, in part the reason why their logging configuration puts a hard limit on log size... Well, bottom line, their logging API is bad. But, you won't abandon the service because the logging API is bad. You'll suck it up, especially since they give you a bunch of free stuff... You'll just pay a bit extra, if you don't need the free stuff.


With the PXE boot issue... amazon has millions of servers and they roll changes out to hardware on a 5 year cycle or versions/skus of the compute. So anything hardware wise will be detectable when your software doesn't work on the new instance type/sku/class. I've had this happen, multiple times over 10 years, and always just canary out new instance types, in fact I setup our systems to make this simple. But you can easily try it out and hand the hardware back if it doesn't work. We reported the issues every time. I believe AWS scrapped one sku due to an issue we reported. They also pulled back skus from us because another customer realized they were broken in a away that would impact everyone eventually.

When it comes to software amazon again has millions of machines, regions, availability zones. AWS is very very slow and deliberate about how they roll out software. They do things by Sub AZ (shard, cell), AZ, Region... they can and do split on other axes as well. Rolling out a PXE boot change would be caught by increase in failed boot or under utilized instances (they didn't boot right) pretty quickly in a roll out and rolled back.

All of this is aided by many may of AWS's customers treating servers like cows not pets, so whacking and restarting a small % of hosts isn't super detrimental and likely won't even be noticed. AWS can be extra cruel to internal teams actually. We have have a few concepts internally that push users to be even more ephemeral than normal.

As for how do people learn PXE boot? Experience or on the job training. I have no idea if AWS uses PXE boot anymore, I suspect they have a specialized setup, boot vms directly, etc. There's hardware security with the hypervisors you don't get on normal machines. I PXE booted a linux desktop the other day in the office to install linux so the corp side does use it.

I'm not really sure what the point you are getting at is tho. The cloud providers are doing things on a scale most people are not used to thinking of and the problems are often quite different as are the solutions. Are you rolling your own custom hardware with FoxCon (https://www.importgenius.com/suppliers/foxconn-aws)? Do you roll your own custom silicon (https://www.amazon.jobs/en/landing_pages/annapurna%20labs)? Would you be solving PXE boot issues if you had these capabilities?


What happened to all the engineers who knew these things before? Was this so long ago that they have all retired and died off?

Or is it that there are orders of magnitude more internet-based companies today, all competing for the same talent (which, indeed, is a limited skillset because these are not generally applicable skills). How many of these new companies exist _only_ because cloud infra is available now?


The industry grew a lot. I don't have the recent numbers but around the time AWS became a household name the growth in the industry was about 5% yearly. Programming was one of the fastest growing industries for a while (probably not true with recent wave of layoffs, but it's probably still growing). But it grows very non-uniformly. There's a lot more growth in the Web sector. Web is one of the areas with more simplified infra, which can be catered for by the public cloud providers more easily. So, while there's growing demand it's satisfied by services rather than employees knowledge.

Another factor is promotion into management. Programmers have very little to look for in terms of professional growth. Those who "make a career" tend to go into management. It's so to the point that if you were in the industry for 20 years and you aren't a head of a department at least, you are seen as a loser. So, a lot of those who did know their trade decade or two ago are ticking boxes in Excel today.

Yet another aspect is the nature of venture capital investment into startups which encourages very wasteful short-term solutions. So, while it may be a more financially sound strategy to do things on-prem, the investments are rationed in such a way that there's no money to build the infrastructure upfront, even if in the long run this would've been a better financial strategy.


Is there some institutional barrier to training new hires? It seems like the inevitable result if you need an increasingly niche skill. The situation doesn't seem like it'll get better on its own.


I work in an infra department, which used to be its own company before it was bought by a bigger one. Out of about 20 developers, one is younger than 30. Most are in their 40s and some are 50+.

I know that when I was hired, my job was advertised for over a year, if not longer. They pay may not be stellar, but it's still pretty decent...

In my previous position, I was also responsible for screening and hiring, so I also had some insight into what's available on the job market. My impression is that the younger generation simply doesn't even know this kind of job exists. Those who go into programming aspire to be a Web developer, or an ML developer, or a game developer... DevOps is one of those things people have heard about. But, this label was coopted by cloud some time ago. So, those aiming for DevOps role think they need to learn how to use public cloud API...

It's actually quite eye-opening to have access to the hiring process. One of the very surprising things to me was that newer generation of programmers tends to be confused about persistent / volatile storage. So, if you asked them to give a description of what a "file" is, more than once I've heard an answer like "a portion of memory". Which, while true in some sense, is a weird way to put it. But then I realized that people who are entering the field today may not have been exposed to PCs. Naively, if anyone asked me to show where the files in the computer live, I'd get them a picture of a hard drive. But people who only ever used laptops probably don't even know what a hard drive looks like.

Finally, it's also a chicken and egg problem. You cannot really learn about infra outside of a big org that has its own physical infra. Similar to how you'd never be exposed to configuring BGP if all your exposure to computers is your own PC. A popular method of learning today is to learn online. But, if you do that -- you are almost inevitably driven to the public cloud, because that's the way to provide you with all that virtual infrastructure online. You cannot also practice at home, since the hardware to practice on tends to be very expensive and not very useful for individual users.

Other fields get it easier because they have to select from somewhat prepared candidates. When it comes to infra, it's a lot more investment into teaching one.


Train by who? How do you start this process? If you never had the expertise, or more unfortunately had it but lost it, it's very expensive to gain it.


This seems to be a growing problem across the economy. Companies weren't willing to train for so long that now not only is there no one to hire who knows how to do jobs, there's no one left to train people to do them.


Considering the cost of the infra it would take to replicate some of these services, not to mention bandwidth, DC, and electricity costs, you're probably still saving money. You're absolutely saving on Network Ops and Data Center Ops not to mention the huge investment in gear one has to make. A single server can cost over 100K. Network gear can costs way more. The fact that you don't have to make those investments is the allure. At a certain point, you might outgrow someones else's cloud and build your own, but that divide is a fairly large threshold to cross.


>>A single server can cost over 100K. Network gear can costs way more.

"Can" sure... likely to for most orgs no. Certainly not for someone running a 5 figure cloud bill...

as a reference, our standard compute server we use on prem is $12,000 including a 5 year support contract.

> At a certain point, you might outgrow someones else's cloud and build your own,

If your buying $100K servers, and $100K network devices, I pretty sure you are at the point


if you're buying either, you're paying millions in salaries to technicians and sysadmins and DBAs^W^W^Wdevops and SREs.

i've been on both sides of the fence and it's a case of 'grass is always greener on the other side'. the truth is, running any sort of non-trivial infra is ducking expensive.


This is what always made me chuckle a bit. I have a half dozen friends who were sysadmins 20 years ago and today they're "DevOps".

Precisely zero have been put out of work.


Yep. It's not like either sysadmin or devops are harder than or easier than the other... they're just different.

Actually, seems like managing k8s is an order of time expenditure greater than managing an old-school F5 with a bunch of Unixy web servers behind it.


Yes, time and complexity.

Managing a switch and servers is a piece of cake compared to managing k8s, IMO.


Depends on how much abstraction you have, I have seen big companies where deploying code is basically like using Heroku. As an engineer responsible for a couple of services you don't need to know or care if this code is running on bare metal, Mesos, K8s and you care even less about the data center.

I come from this old world of managing switches and servers and today we definitely need a lot less people to run code in production. I used to work at a company with ~2000 machines in physical data centers before containerization, this required a huge infra team - I'm sure that today I could support the same workloads with half the team.


When I was a sysadmin it was definitely harder.

Devops gives the benefit of cows instead of pets, and a ton of reusable work. So I get why it's more 'valuable' from a remuneration basis.


Pets aren't that bad if your pets are a few elephants.


Yeah but how often are they elephants, and how often are they cattle that have just been neglected for 15 years?


>seems like managing k8s is an order of time expenditure greater than managing an old-school F5...

or Nagios...


Having worked half my career at places with their own data centers and self ran infra, and the other half with mostly cloud based solutions, I have a theory.

Perhaps we are designing far more complicated solutions now to leverage these cloud services, whereas having the constraints of a self operated data center and infrastructure necessitates more ingenuity to achieve similar results.

We used to do so much more with just a few pieces of infrastructure, like our RDBMS's, as one example. It was amazing to me how many scenarios we solved with just a couple of solid vertically scaled database servers with active-active failovers, Redis, an on-prem load balancer, and some webservers (later, self hosted containerization software). We used to design for as few infrastructure pieces as possible, now it seems like that is rarely a constraint people have in their minds anymore.


Amen, I'm becoming an old grumpy engineer on my team for constantly asking why we need yet another <insert cloud technology here>. I'm not against new technology but I am against not considering what we have and how it may already solve the problem without adding wider breadth to our operational surface area. And it's every single damn year now because now cloud providers string their own cloud primitives together to form new cloud services.


Could have just called an API but instead we fired an SNS event. Sigh.


How many times I've had this discussion. Let's publish a notification, and let's have the message receiver call some API. Why not just call the API from the place where you want to publish the message? Because we need this SNS message queue.


Probably because the API can be unreachable, timeout, etc — with a message queues it can be redelivered without permanently dropping customer data or whatever with only a stack trace to remember it by


That's naive without any context to claim. You have to know the source that triggers the code to publish the message, what the message is for, the fault tolerance and availability of the API we're calling before you can even begin to decide. Which you validated perfectly by giving a snarky "what about redundancy" answer to a complicated question.


> Perhaps we are designing far more complicated solutions now to leverage these cloud services, whereas having the constraints of a self operated data center and infrastructure necessitates more ingenuity to achieve similar results.

Nothing about "ingenuity", just plainly having some friction in implementation makes for simpler designs.

If you have zero cost (aside from per request pricing but that's not your problem right now, that's management) to add a message queue to talk between components, now that's a great reason to try that message queue or event sourcing architecture you've read about.

And it works so "elegantly", just throw more stuff on queue instead of having more localized communications. We don't worry about scaling, cloud worries about it(now bill for that queue starts to ramp up but that's just fraction of dev salary, we saved like a 2 weeks of coding thanks for that! Except that fraction adds up every month...).

Repeat for next 10 cloud APIs and you're paying at every move, even for stuff like "having a machine behind NAT". And if something doesn't work can't debug any of it.

Meanwhile if adding a bunch of queue servers would take few days for ops to sort monitoring and backups on it, eh, we don't really need it, some pubsub on redis or postgresql we already have can handle stuff that needs it, and rest can just stay in DB. This and that can just talk directly as they don't really need to share anything else on queue, we just used queue to not fuck with security rules every time service needed to talk to additional service.


it is the classic find a problem to use our solution, or XY problem

As an example I have seen many times people attempt to find a reason to use k8s because the industry says they should instead of looking at what they need to do and then determining if k8s is the best for that application


Our reason was pretty much "clients want to use it". One migrated to it for no good reason whatsoever aside from senior dev (that also owned part of the company) wanted to play with new toys. Other one halfway decided that their admins don't really want to start k8s cluster and just told us to deploy resulting app (which REALLY didn't need k8s anyway) on docker.


Maybe they’re looking for an excuse to gain k8s experience to bolster their resume? If most startups fail, might as well gain some skills out of the current one? Perhaps it doesn’t benefit the startup though, inflating complexity, infra spend, and slowing productivity.


And as far as problems go, "we're so successful that it makes financial sense to build our own on-prem cloud" is a pretty good problem to have.


I always figured it was the other way around. When you're small it's pretty easy to get by with a stupidly simple solution but as you grow you end up needing to spend much more to build something scalable and at that point, using the cloud makes sense. The biggest success that cloud providers have had is convincing users that they need to spend $100k and that a much simpler $5k solution that's built using off the shelf components just won't cut it.


I see the cloud mostly for startup-ish companies hoping to grow rapidly but which want to avoid large upfront expenses to be ready for said growth.

A stable company where growth as a percentage isn't likely to be significant can run things cheaper on their own in most cases. At least if you consider the cost of the inevitable departure from the cloud provider either to switch another or to go on-prem. And if you aren't willing to make that exit, you can guarantee your cloud provider won't stop cranking up the fees until the threat of you leaving surfaces.


I think this is a pretty key point. If a business is going through any kind of rapid change, cloud providers offer a lot of off-the-shelf help for that, be it ability to scale, hosted infrastructure, or PoPs in new geographies. If the company is relatively static with easily predictable future requirements, you can get a lot more bang-for-your-buck by handling things on your own and developing your own in-house expertise.


There is also a third approach that is the best if you have a predictible base load with surges sometimes imo: hybrid cloud

You basically run the base load in your own data center and the surges go to the cloud. My university is evaluating this because sometimes you have multiple labs that need a lot of compute resources at the same time and local compute cluster has finite capacity.


Time to market and avoiding NRE is great. Margin doesn't matter in the beginning.

But hopefully you don't get trapped in the cloud and can claw the margin back.


The most painful is having to run multiple data centers for HA. Double or triple the price right there.


> The most painful is having to run multiple data centers for HA. Double or triple the price right there.

Ok, let's make one thing patently clear: ITS THE SAME IN THE CLOUD

All the cloud vendors will tell you need to have stuff replicated in multiple "Availability Zones" or "Regions".

And yup, the nickle & diming nature of the cloud means that's going to cost you double or triple.


It's not though. With your own stuff you have at least one DC sitting idle, with all that private gear doing nothing. Doesn't matter if you don't use a single byte of bandwidth. With AWS at least some of that is not there.


If you're set up for HA you're paying for the idle hardware either way, and if you save on electricity that might benefit the DC option but not the cloud option. Overall not much difference there.

Bandwidth is the one thing where the cloud clearly wins wrt idle servers... except that DC bandwidth is a hundred times cheaper than AWS bandwidth, so you should prefer buying 133% or 150% or 200% DC bandwidth by a mile.


Whether you are paying for HA depends on your Recovery Time Objective (RTO). You can have a bunch of suspended EC2 instances and non EC2 resources where you only pay per use in another region.

You can redirect traffic to another region and have autoscaling spin up EC2 instances, etc.


Sure, if you can wait for it to load from unallocated resources (and risk failure) then it's a very different scenario.

But, very notably, you can have a suspended cloud backup even if your main servers aren't cloud. And the added complexity for datacenter-to-cloud HA doesn't have to be significantly higher than the cloud-to-cloud version.


Entirely depends on use case. If you "just" need a lot - a lot of storage, bandwidth, CPU power - going on prem is way cheaper when you get up to "few racks of servers".

If you complicated your architecture enough - and the cloud makes it oh so easy to make rube goldberg architecture - keeping many different services running or even developing in-house can take a lot.

And it's not like cloud costs you zero in ops work either, just need different set of skills.

But it is not like on-prem stagnated - there is plenty automation in that space too. Our team of 3 manages 7 racks of servers and few dozen projects on them (anything from "very legacy" to 30+ node k8s cluster and ceph storage with it) and the hardware management still isn't majority of our work


A single server which can probably serve your complete customer base could / might cost less than 5k


As I recall, StackOverflow runs (ran?) on 6 large but not massive servers.


They also used the dotnet stack (Windows Server, IIS, MSSQL, dotnet), and optimized the heck out of everything. They're not the typical use case.

(I'm not saying dotnet allowed them to get by on N low digit servers. I'm saying those folks are atypical)


So assuming your code is 5 times worse that still fits within one rack

> I'm saying those folks are atypical)

Calling "write not shit code in language that's not dog slow" "atypical" is sad state of our industry.

Also in many cases you can get a lot out of caching if you do it smartly.


We did very similar with a Java stack without even trying really. Competitors using things like Ruby and went all in on distributed messes had hundreds of servers but we had about 15. It does require you to be aware of performance, but I wouldn't call it difficult or particularly time consuming.


What are you saying by pointing out that they use the dotnet stack?

It’s interesting trivia to be sure, but I’m wondering if you were making a point with that


Have you seen the performance of standard ruby and python web frameworks in comparison? It's a massive difference


I am gonna go out on a limb and say that given that they're talking about replication they mean server rack which definitely is not $100k/mo but can pretty easily be $100k up-front.


Sure, but 100k buys you a lot of iron and silicon.


Add in redundancy and that number quadruples bc you need to manage shared state and 2-3x the hardware


> A single server can cost over 100K. Network gear can costs way more. The fact that you don't have to make those investments is the allure.

IF you're pushing 400+ Gbit, sure. Most won't.


A six figure payroll bill related to teams working with cloud infrastructure? So, a low-single-digit number of engineers?


The real cost in most places is in opportunity cost and lead time. e.g. you want to do a thing but you can't because you don't have the physical resources available. That can be very frustrating for people trying out things that will produce novel revenue in the org. Reducing the cost of experimentation is quite valuable in a larger organization.

For something like our HFT firm, we see high sustained utilization so we have an on-prem cluster of nodes that we run. They're faster (more powerful CPUs), have more RAM, and you can run fiber optic interconnect that's, in practice, faster than on AWS. We still use AWS extensively, though.


Except that for the same cost in cloud you can't just have double the resources in dedicated servers, but like 3x to 10x depending. And if you do colo, even more.

I get that if you have management that won't keep even 10% spare capacity online you can't work like this, but they will also "tightly control cloud spending", won't they? Not sure.

Cloud only really makes sense at really small scale.


Everyone's use cases are different. But I think I'm making the right trade-offs here overall (I am management in this world).


You can use fiber for hft through AWS as well, but you lose a lot of the benefits of cloud when you have a sustained workload that doesn’t benefit from being fault tolerant because the system would be down anyway.


> Every org I've ever worked at has had 6 figure monthly cloud bill, and the 6 figure monthly payroll bill related to teams working with cloud infrastructure

There was a point not too long ago when I managed a 6 engineer team responsible for 7 figures of monthly spend. Still seems outlandish to me despite the ops burden being pretty low once we had things sorted.


GitHub is paying $3-4 million per month. There are EC2 images dating back to 2011 wheeee


Github is not on Azure? TIL


If so, that's quite the case study on the difficulty of cloud migration!


Any chance to contact you? Would love to ask a few questions.


> All it takes is [bad things] and we're off to another

> By [avoiding those bad things] they're locking us in

Providing a good product at a good price is "locking in" ?


Hmm, don't think that applies all the time. Herznet is way cheaper than AWS, and in most cases, I don't know why companies won't use it instead of AWS. The only reason people use it from what I know is that helps them with the next job, because everyone is using AWS. Kinda feels like ms and oracle in the 90's.


Hetzner's dedicated servers have a limit of 10 firewall rules for their (hardware based) equivalent of AWS Security Groups.

And some of those rules already have mandatory entries, so you'll likely only have around 7-8 actually usable.

"Just use the firewall in your OS" is a workaround, but doesn't fully cover the same scenario's. That's why AWS has Security Groups after all.

This limitation means we have to use smaller sized servers (64/128GB ram), rather than bigger ones which would otherwise be more cost effective.

It's a real pita for us, and Hetzner have shown no interest in ever addressing the problem. :(

That being said, it's all still a lot cheaper than AWS, as Hetzner doesn't charge for bandwidth with their dedicated servers. :)


Herzner is way cheaper. But it doesn't have a database service or an S3 analog. Both of those are needed if you want to avoid spending a lot of money on operations staff.


There are more factors than "nobody got fired for buying IBM" regarding AWS/Azure/GCE:

- the Big Three have data centers all over the world. No other provider can match that. (Obviously: if 99% of your customers come from the EU, Hetzner, OVH and friends will be cheaper)

- there is a ton of prior experience in managing and provisioning infrastructure on the Big Three. You got Terraform providers covering every oh so tiny resource they have, and for literally every problem you got dozens upon dozens of stackoverflow/serverfault posts.

- the Big Three have allll sorts of managed services. No matter if you need cheap blob hosting, virtual servers, bare metal servers without paying setup fees, AI training servers, VPNs, global load balancing/caching, data storage/archival, databases, "serverless" hosting, email, SMS, IoT communication, Active Directory and other identity/authentication/authorization solutions, business data analytics, whatever - it's all a one-stop-shop. No dozens of vendors, bills, payment schedules, GDPR processing agreements, budget issues - one vendor, one bureaucracy, everything you need.

It's really bad that the EU never woke up to the rise of the Big Three and say, for example, financed the development of OpenStack to the tune that it could be used by domestic providers, because now the Big Three have all but eliminated the smaller competition.


If all you want is bare metal on the most interconnected backbone, try Equinix https://deploy.equinix.com/metal/.

We're not as cheap as Hetzner or OVH, but we're just as global as public cloud and will match or beat the latency.


Heh fascinating. I thought you guys were purely colo hosters, my employer has a ton of rackspace at one of your DCs.


Well sure. We're happy with the service and support, so keep deploying more and more. It's not locking in like a gym membership sure, more of a trust relationship. And the more we deploy and use, the harder it would be to jump ship.


Egress costs alone make leaving pretty much a non-starter once you're entrenched.

Leaving really involves running two systems in parallel for a bit and gradually doing a changeover - blue-green in production between two different clouds. Which is not cheap/free either, you are actually going to increase your costs significantly as you leave.


Appropriately sneaky: https://aws.amazon.com/snowmobile/ appears[1] to allow export of petabytes of data but then the FAQ reneges “Snowmobile does not support data export”[2].

[1] tagline: “Migrate or transport exabyte-scale datasets into and out of AWS”.

[2] “Q: Can I export data from AWS with Snowmobile?” A: No. in https://aws.amazon.com/snowmobile/faqs/#Using_Snowmobile


The answer, for the curious, is to use Snowball Edge.


Not an answer for the Exabytes they appear to be falsely advertising they can export in the tagline. A mediocre answer for Petabytes. I agree Snowball Edge is one answer for Terabytes: “HDD storage capacity: 80 TB” - https://docs.aws.amazon.com/snowball/latest/developer-guide/...


It seems your infra spend is fairly contained compared just to the human cost of your organization.


<incoherent-stuff/>


Total spend = 1k * 100k = 100M

AWS spend = 500k * 12 = 6M

AWS spend / Total spend = 6%

Where the hell does 0.1% come from?


Yeah, that was broken. Please ignore. (Sorry!)


The original promise of the cloud was that it was cheaper than on prem. For many companies this has turned out not to be the case. Having worked on a cost optimization product for the past year (https://vantage.sh) my belief is that the cloud _is_ cheaper than on prem but it is difficult to configure it to be so.

In particular it is difficult and stressful to understand all the knobs to turn within each cloud provider. It's a very AWS thing to put a lot of effort into lowering their customers' bills and I do think it makes for longer lasting business relationships.


> my belief is that the cloud _is_ cheaper than on prem but it is difficult to configure it to be so.

It takes discrete effort in the core planning of your architecture. You cannot look at AWS resources (or any other cloud provider) as an all-you-can-eat buffet. You have to look at your requirements, look at the viable options that may support your requirements and take into consideration cost associated.

Without looking at cost during design your company/service/team will have a bad time at some point. There are also some really specific things you can do to lower costs but require knowledge and understanding of a myriad of technologies.


you can and should look at them as aN "as little as you can eat buffet" option. the cost transparency is a huge part of the value of cloud, even if it cost the same as on prem, the transparency is huge, but it's even more valuable if you make smart choices with that transparency.

of course often folks just ignore costs until al of the sudden it's a huge pain point.


Our team typically expects some cost consideration given in design documents. This way we know that some attention was given to it and it's a good starting place for discussing costs during design review.


I genuinely struggle to imagine (realistic) scenarios where cloud is cheaper but I am definitely interested in where that would be the case.


I have a realistic scenario:

I write event (think food festival) software. 9-10 months out of the year there is almost zero traffic, 1-2 months have mild traffic, ~2 weeks are higher traffic, and then 1-5 days (during the event) are very busy.

I'd be hard pressed to find anything other than my current stack (Lambda/DynamoDB/S3/Route53/APIGateway) that costs as little. In the off-months my costs are ~$5 if that and in the month of the event that might get as high as $15-30. I cannot imagine being able to host it locally/colocated (for the month of the event) for that cheap and that doesn't even touch on the hardware cost (I'm thinking just electricity and co-locating costs).

Yes, I'm a small fish and this is just my side business but for me the cloud is way cheaper. Anything with very spiky/uncertain load can be good candidate for the "cloud" but a big part (IMHO) of making the cloud work is using the managed services and using them smartly. If all you do is spin up EC2 instances then no, the cloud is probably not going to be cheaper in the long run.


I think you're almost always gonna pay far more for cloud compute than on-prem. It's pretty standard for companies to have roughly 2x the "peak" load than "minimum" load, so while with on-prem you maybe have to over-provision by ~1.5-2x, that's greatly outweighed with how much more expensive cloud compute is vs. on-prem compute. The only exception is companies with extremely spikey demand - e.g. if your peaks are more like 10x or 100x your troughs, then yeah, cloud is probably cheaper from a pure compute POV.

However, unless you use HUGE amounts of compute, cloud is probably the right choice, because you probably save a tonne on salaries by going the cloud route. It's way easier to properly operate a managed DB than your own (especially in terms of no-downtime upgrades, no-downtime scaling, backup/recovery, etc.), same goes for managed file storage (S3 type things), same goes for managed pub/sub, same goes for managed K8s/whatever, same goes for managed load balancers, etc. For a lot of startups, total cloud spend is equivalent to just a few fulltime Engineer salaries, and they also use a tonne of cloud services, getting that all running well without AWS (or Google Cloud, Azure, whatever) would take way more than a few fulltime Engineer salaries.

FWIW, the last company I worked at (~1000 people), as our AWS bill was getting into the millions annually, we tried migrating off AWS, onto a colocation setup. I didn't work on that project myself, but my understanding is they spent a bunch of engineering effort on it, were still nowhere remotely close to being able to truly move off AWS, ultimately canned the project and moved all the colo stuff back to AWS. I think a lot of engineers underestimate how much effort it is to provide the kind of cloud services Amazon/Google/Microsoft provide, at comparable levels of reliability and ease of use. It's A LOT of effort.


With fixed hardware, you're always planning for max workloads.

The systems are scaled up for the "Black friday" sales event.

With cloud, you will plan for the steady state and just boost it up when you need to.

The anti-pattern is usually that when on-prem hits 80% load, some engineer is usually tapped to weed out all the slow code to push past procurement delays.

While in cloud, people just throw more hardware at it and ignore low hanging performance problems which are costing them money (& unlike the on-prem, fixing it will immediately reflect in the budget the next day).


At low scale it can definitely be true. My startup spends $0 on ops salaries because it's possible for an AWS-proficient backend dev to not need more than an hour or two a week to maintain the entire production infrastructure. Our AWS bill costs us less than it would take to hire a person to run on-prem physical infrastructure.

Not sure if you consider small startups to be realistic scenarios, though, so perhaps this doesn't count.


Here's one: F1TV

Their peak demand is insane, especially at race start, and is only sustained for a few hours for ~20 weekends a year.

AWS have already built a live streaming stack for their other customers with DRM and support for a wide variety of platforms. And there are other features like live rewind, restart, live to VOD etc.

It is also relatively straightforward to serve new countries, you don't have to build data centres all over the world, just stand up your infra in a new region.


Four jobs ago I worked on a search engine for a large national sales company. They'd sparingly ran TV ads. When they ran a TV ad their traffic would 10x immediately then a mild smooth increase over the next month or so. Would it have made sense for them to operate on prem machines that could handle their peak capacity, when they only needed their peak for hours per week?


When you need a lot of computing power for a very short time.


Any use case that has

A) a user base with usage that varies significantly over time

and

B) an architecture that can actually scale significantly up and down during the same time period

So if you have lots of web back ends and business logic and general workers for jobs that can just be shut down when there's no capacity needed, there's a ton of savings available in the cloud.

If you have a big monolithic architecture or some other organization where everything is just on all the time, the cloud makes less sense as a cost savings.

I once wrote an internal chat bot which was used less than 2 seconds per day, but was actually supremely useful as it really simplified some process or another and took a fraction of a second to run each time. The actual cost to run this in the cloud was approximately 1 cent per year. If I didn't have this cloud capability it would be hundreds of dollars per year to host or thousands of capital in machines and maintenance to host myself.


I do agree that it’s a shrewd move to build relationships but I also don’t think that it’s clear cut that companies are spending more to be in the cloud versus the cloud costs being much easier to see compared to on prem TCO, especially when it comes to staff time or the costs projects take on when they’re limited to the services their IT department is capable of building and operating.

A huge confound here is that in any environment these kind of metrics are proxies for a lot of cultural health, so it’s easy to find places hemorrhaging money on AWS without having an easy way to know whether the same managers would have wasted comparable sums internally because e.g. the root cause was letting Accenture send a bunch of 25 year olds to design whatever adds the most expensive terms to their resumes, so while it was AWS today a decade ago it was a $10M Hadoop cluster holding one MacBook Pro’s worth of data.


For my tiny startup, the cloud is two to three orders of magnitude cheaper. We could not exist were it not for the cloud, AWS in particular.

Last week I had a call with the AWS support team helping me get some costs under control with improved architecture and optimizing how I pay for compute. Based on our chat, I believe I can reduce costs by a third, which for us means multiple more months of runway, critical time to find P/M fit.

At my old job, cloud costs were out of control, mostly because there was a huge effort to “lift and shift” the existing systems, rather than redesign them with the cloud in mind.


> two to three orders of magnitude cheaper.

Bold assertion. Care to shed more light? Cheaper in terms of your labor costs?


Yeah, labor. The team needed to manage the variety of infrastructure required to provide the same functionality and availability as the various AWS services we use would have been well beyond prohibitively expensive.


I've worked in cost optimization initiatives in several teams, and how they get themselves in a money pit has different causes (e.g. lack of skills in the team, speed to market, etc.). From experience most teams that haven't done cost optimization have huge opportunities for savings (in one of my teams we reduced cost by 90% while 3x our traffic, and we still had a few optimization opportunities that we decided not to pursue).

On prem has less flexibility on cost reduction, and if cost has already been incurred, at best they can reduce their bill moving forward.


If you don't mind me asking, what did you to do reduce cost 90%?


> The original promise of the cloud was that it was cheaper than on prem.

As soon as it became clear that this was not the case, the selling point swivelled smoothly to "enablement" and "agility" without missing a beat.


If you look at TCO, and include opportunity cost for those on-prem services where you might be caught behind the supply chain eight ball for weeks or months, then I think you're right. But you have to look at all sources of costs on both sides.

And it's certainly possible to configure any cloud provider in a way where it is much more expensive than a corresponding on-prem solution, and it's also possible to do the reverse.

Where things get interesting is where you compare all sources of costs on both sides, and both sides are as cost optimized as you can feasibly make them. Then you are a rare unicorn indeed.


Cloud is a service that moves CapEx to OpEx. What ever cost savings it may provide is a side effect of this move. If you moved to cloud and do not see any savings, that means you have sized planned your capacity well. But there are a lot of businesses that cannot plan ahead for various reasons, and cloud provides value to them.


Vantage.sh has SSO tax, et tu brute?


it's an interesting evolution since in the industry the perverse incentive is to not invest in efficiency since it would blow up revenues charged per volume.

snowflake is going through such a crisis, since they have a bill for volume model.


So long as Software Engineer timing is not respected or tracked (compared to other engineering disciplines), in times of economic downturns, it will always be popular to shit on cloud.

People stop caring about the elasticity, about the operationial overhead to their people, additional 9's of availability, even about feature delivery velocity.

Tech companies turn into accountants looking at X $/Month in the cloud vs Y $/month from "buying a couple of servers".

Then you have "luminaries" like DHH and Elon blogging about how much leaner and better their services are as a result (never mind that twitter is now a buggy mess where everything is eventually consistent and breaks half the time)


Not just their "clouds" but their "cloud spend".

From the article:

> Amazon Web Services sales and support teams are currently “spending much of their time helping customers optimize their AWS spend so they can better weather this uncertain economy.”


No doubt by pushing the clients to use inappropriate vendor-locky-in serverless tech stacks that are inappropriate for the workloads.


I was in Google Cloud's partner org for >7 years, managing a team of solution architects pointed at our most strategic partners (Accenture, Deloitte, Infosys, HCL, TCS, etc). I have been saying for nearly the entirety of the time that the way things will pan out for hyperscalers is going to roughly look like this:

AWS = the bazaar of public clouds. A service for nearly everything, but WYSIWYG and quality & consistency will be all over the place, just like amazon.com has become.

Azure = will ultimately own the enterprise. Microsoft is the new IBM and their mature GTM and Professional Services organizations will carry increasing value as cloud services become mostly commoditized. Microsoft has multiple large revenue streams to fund disruptive acquisitions and these will bear fruit through good management. Satya is proving to be an exceptional CEO.

GCP = an also-ran on compute/network/storage, but potentially differentiated on data, analytics & AI if leadership can organize the troops and keep focus on things that matter. Still very rough on the GTM side, and late to the party, making enterprise sales that much harder. Has some unique strengths, but can't rest on its laurels. Execution and focus are key.

Oracle = Making a killing selling an easy-button-to-cloud to existing Oracle on-prem customers, and frankly, this is plenty to keep things moving forward for a while.

IBM = Lots of hopes & dreams associated with RedHat acquisition and GBS/GTS split. It's still a bit early to tell how things will pan out, but IBM isn't playing the same sport as the rest, and has a lot more in common with Oracle than the big three.

My prediction for the next couple of years is that AWS continues to lose market share due to QoS and ultimately being spread too thin (and alienating partners), and all the rest capture it. MSFT is the biggest winner, Google will eventually find a sweet spot (it's already in the "too big to fail" category, and will achieve profitability this year), and ORCL/IBM will see increasing success in their niches but largely avoid direct competition/comparison with the others.


> AWS = the bazaar of public clouds. A service for nearly everything, but WYSIWYG and quality & consistency will be all over the place

> My prediction for the next couple of years is that AWS continues to lose market share due to QoS and ultimately being spread too thin (and alienating partners), and all the rest capture it.

Need data points.

I've used AWS, GCP, and Azure. AWS is still the only cloud I'd use for business critical infrastructure. There hasn't been a drop off in the quality. I'll take AWS compute (EC2, ECS) over anything at Azure and GCP. Warts and all.


> Azure = will ultimately own the enterprise. Microsoft is the new IBM and their mature GTM and Professional Services organizations will carry increasing value as cloud services become mostly commoditized. Microsoft has multiple large revenue streams to fund disruptive acquisitions and these will bear fruit through good management. Satya is proving to be an exceptional CEO.

Given AWS is still growing at 40%+ a year on their huge base, vs Azure who have struggled to hit 30% on a much smaller base... I'd expect the opposite to happen here. Microsoft will end up as the home for companies who 'buy' tech, and Amazon for the companies who 'build' tech.


not only that azure puts as cloud $ everything including the windows xp wallpaper.


People underestimate AWS but as long as their S3, EC2, SQS, and Lambda keeps kicking butt, it'd be really hard to displace them.

All roads ultimate lead to storage and compute, which I think AWS does a really good job in.


Are these all kicking butt? S3 is great, and I don't know much about SQS...

But EC2? Configurations are static, cost optimisation is hard, the spot market is a pain, reserved instances aren't fun to trade-off. Compared to GCP where you get sustained usage discounts which are much simpler to reason about, and you can spec machines in much more dynamic ways.

And Lambda? It's just not a great environment to build applications for. Frameworks are necessary to tame complexity, API gateway adds a ton more complexity, dealing with any state is difficult. Lambda doesn't seem any better than any of the other FaaS offerings. Fargate seems like a better option for most people who want to use serverless, but again that's not unique.


AWS compute savings plans are easy to reason about, though, and cover things like Lambda and Fargate, too. GCP’s sustained usage discounts are nice but they don’t save more than AWS’ savings offer and in the last 6 months that’s been a lot lower than GCP’s cost increases on other services.


I don't think they are easy to reason about. They're non-trivial to explain, and to figure out how to make best use of them requires quite a lot of modelling, usage projection, etc, to figure out how many of each type to buy.

The lowest priced options also require pre-purchasing, which ties up cash. They're rarely worth it for a business with typical cash flow.

From a cash flow perspective "it just gets cheaper after 3 months" is very straightforward, requires no usage projection, and not a lot of modelling to figure out eventual prices. The fact that it's not opt-in also means that more businesses will get the benefits, so many don't use the reserved instances because they don't have the capacity to figure out how best to use them, or worry about giving up the flexibility.


> The lowest priced options also require pre-purchasing, which ties up cash. They're rarely worth it for a business with typical cash flow.

This is probably where our experiences differ: I’ve dealt with this at large organizations where cash flow isn’t tight and the accountants have never had trouble working with the prepayment concept since they do that for tons of things.


That's fair, but cash flow is a major focus for many businesses, and with AWS it's still _hard_ to figure out the impacts of all these things, especially if you're a growing startup.


As for Lambda, if ur app gets less than a million requests it’s basically free, which I love. API gateway lets me sleep at night because of all it’s security features.

EC2 is kind of crappy UX wise but spot instances at cheap af


1. At which point does it become cheaper for AWS to offer discounts to customers rather than optimizing their spend?

2. At which point does it become cheaper for AWS to just simplify their pricing structure?


Optimizing spend for a customer is also optimizing capacity, which is indeed finite. By optimizing spend with right-sizing and such, they lose a very small amount upfront, but can increase customer density, which is a much more sustainable data point for modeling capacity growth. More customers on the platform is more opportunity for larger growth at a sustainable pace that they can scale capacity for.


1. "optimizing the spend" is probably overengineering some insane Rube Goldberg machine which locks you in more and commits in engineering resources which could be spent thinking about moving to a different cloud / onprem

2. See 1. WAI


Personally, this is what I expect of them, specifically for smaller orgs that don't have CISO, SRE or other ops folks needed.

Now, I'm not saying this should be the expected behavior for everyone or every situation, but "aws staff spends majority of time doing what aws staff is hired and skilled to do"

Seems pretty "cost of doing business" to me


I think that this consumer oriented facet of Amazon is impressive. No other provider will dedicate this time to clients.


I can make a similar argument regarding Jehova's witnesses.


Guys! We have another ~~Linux~~ Zealot over here!


Oh, I have had good experiences with German providers Manitu and Corpex. Very good service, although the latter almost exclusively deals in managed hosting.


A few weeks back helped someone move from EC2/ELB/RDS to Lightsail Vm/LB/Database The same app and stuff, but the price is now fixed[0]. So many apps can use something like Lightsail and cut costs. I am surprised AWS don't promote lightsail. They also have a weird OS selection and still have CentOS 8, which is no longer supported. They treat Lightsail as a second-class citizen.

[0] https://aws.amazon.com/lightsail/pricing/


I think DigitalOcean is a far better Lightsail implementation.

Both products trade predictable performance and support for predictable, low, pricing.

However DigitalOcean is a much slicker product. It's both more capable due to the wider range of services, and simpler due to not having all the AWS cruft. It's still not really appropriate for big services (we had outages due to poor DO network management, poor support), but great for what it is.


I won't use DO anymore simply because they took my droplet offline because it got DDoSed.

The attack only lasted a couple minutes and I was able to weather the storm just fine. I only used it as an IRC bouncer. During the attack, my IRC ping went up, but nothing crashed until I got disconnected, couldn't reconnect, and had an e-mail in my box from DO saying they took my droplet down to protect their network.


Yeah we had a similar thing – our database backup machine was taken offline. It only allowed SSH traffic, and DO claimed it received a web (80/443) DDoS. We said the machine didn't have those ports open, and they said "just put Cloudflare in front of it". Completely missing the point, no actual understanding of what was happening.

So, I'd never put a production machine for a business with an active user base there, but... new machines on day 1 of a startup? Sure, it's dead simple. We also had good success with it providing on-demand VMs for developer use, AWS/GCP was somewhat locked down, cost managed, etc, but the team all had a login for DO and could spin up VMs for one-off jobs, that worked well.


Lightsail is heavily throttled, so I wouldn't use it for anything "serious".


Lightsail instances are just burstable (t2/t3) instances renamed.


Plus a few terabytes of outgoing bandwidth tossed in.

As long as you don't use it to "avoid data fees" which seems to be deliberately vague.


The root problem is that AWS charges way too much for data egress, and then sell cheap instances bundled with several times their purchase worth of egress.

AWS doesn't want people putting their apps behind LightSail instances solely for the purpose of reducing data egress costs. I think it's a very reasonable rule considering the situation they put themselves in.

The problem of course is it quickly leads to bad faith instances of people doing just that, then claiming not to be doing it. But it's going to be impossible for them to write a lawyer-proof rule that prevents it while also not preventing legitimate usage.

Like, if I run a web app that accepts requests to example.com/mybucket/myfile.zip, and the app merely pulls myfile.zip from the mybucket S3 bucket and sends it, then yeah, that's a pretty cut-and-dry case of using LightSail to avoid egress fees from S3.

But what if I'm using a LightSail instance as just a load balancer in front of EC2, because I don't think an ELB/ALB is flexible enough? That starts to be a bit more of a grey area.


Do you have any evidence that supports this?


Isn't that the point? "Don't build your own infrastructure, outsource to the cloud!" has been the rally call for a decade. It's the entire reason "the cloud" has any value at all, beyond renting at hardware depreciation rates. Someone else is there maintaining the data center, managing failover, ensuring uptime, etc. by contract.

This press release could easily be interpreted as AWS backpedaling on a commitment to maintain its infrastructure.


Isn’t spending a significant portion of resources doing customer-specific work pretty much the opposite of the point? That sounds just like sysadmin work for hire, which is obviously valuable, but is essentially the key distinguishing factor from traditional server computing and cloud computing.

Of course at the edges there is always going to be overlap, since important customers will have feature requests and complaints that the cloud providers will learn from.


> sysadmin work for hire

you'd be shocked to know how many people believe that this exactly what the "cloud" is. these are the people making the support requests.


They still have plenty of time to cripple their UI though


What? The AWS UI is much better than that slow ever-slowing-progress-bars RandomError throwing Azure portal.


Just because there are worse things doesn’t mean the current interface doesn’t suck


It does kinda mean that if everything else is worse.


That's funny - I primarily have used Azure but whenever I drop into AWS I feel like it's actively punishing me for trying to use the UI.


Azure UI is awful for stuff like hitting the back button and reliably going back, or even for a structured view of a resource. Try using AKS, and select any given pod or node and the little "path" disappears and you can't reliably go back to the main cluster without hitting the back button that usually does not work. This is also true for most Azure resources, though azureml is oddly very nice to use, UI wise. For some weird reasons, searching in some lists only works with exact prefixes, meaning that you better have the full, correct name of the thing you're looking for.

Azure is fine to display surface level information, but it is atrocious when it comes to actually doing stuff or finding details. AWS, and the new UI in particular imo, is a breeze in comparison. But then again, my favorite cloud UI is GCP so maybe I'm just weird like that :)


I haven't worked with AKS, but in Azure if you want to go back to the previous "parent" resource it's usually pretty simple, you just scroll to the left. It's a bit strange, but once you get use to it you learn to stop using the back button.


AWS UI is awful compared to Azure. At least in Azure I can quickly search across regions and subscriptions for cloud resources.


Hard Disagree! Azure Portal is the most clear Cloud UI out there. From the Icons to the Blade/Journey UX scrolling sideways.


Why would you ever want to use the UI for, other than troubleshooting?! No matter the cloud provider in question, everything should be done through Terraform or another provisioning platform.

Seriously, doing stuff manually on any hosting provider has become a crimson red flag for me. Even at home for my private stuff, I have ansible scripts for every piece of infrastructure. No more manually set up raspberrys or what ever.


We’ll see if it stays this way, but the goal of ProServe and Enterprise Support is to ensure business outcomes rather than directly drive cloud spend.


Ehh, at some point the automatic suggestions for lambda, athena, sagemaker, whatever is at least a little interested in getting customer lockin/stickiness.


I work in ProServe. But even when I was in doing regular old enterprise architecture on prem, with Azure, etc. You’re always locked into your infrastructure and the cost to migrate any large infrastructure no matter how “agnostic” you try to stay comes with major cost both real and opportunity and risk of bugs and down time.

Show me your “agnostic” solution and I can show you plenty of ways that you tied yourself to your choices. Not to mention all of the organizational constraints and coordination - architecture review boards, planning committees, project management organizations, etc.


We have some kind of k8s cluster that's just locked into a state where we can't delete it because of some wacky dependencies that no one at the company understands (it was set up by others). So at some point we'll reach out to amazon to have them wipe it.

It's kind of a messy product.


EKS is a special level of IAM pain, and my belief is they do everything possible to make it hard to use so people don't get all excited about that kind of cloud agnostic technology


Now if only they'd spend time optimizing their management APIs and SDKs so they were more coherent.

I do not like using the console or having to login to the web portal, but there are things you just can't do from the SDK, or at least, I haven't figured out the right way to do them yet. Entirely Byzantine.

I wonder if they'll ever get around to addressing the mountain of complaints folks have about local dev and emulators too.


Could you share examples of things that can't be done from the SDK? Thanks!


Its a mixture of hard to use and things seem to not work as expected.

For instance, I want to do multi region deploys for lambda, I can't just do a simultaneous push to deploy across regions. I'm expected to use this messy switch[0] between the CLI and the CDK to do anything useful. This is a tiny example.

Not to mention the CDK packages have no comments in the shipped files (at least for typescript) so I can't even introspect the typings and try to understand the purpose of things. The documentation is atrocious.

Consider instead Firebase, Supabase or Fly.io, which are dead simple to use.

Developer Experience needs a major overhaul. It should be as easy or easier as Supabase or Firebase. No, Amplify is not a solution, its too buggy and has zero margin for customization plus, its local dev experience is quite bad

It'll probably never happen though, because developers are just end users, not the true customer, the buyer (e.g. management) is. I think that's the real reason why these improvements languish

[0]: https://aws.amazon.com/blogs/compute/deploying-aws-lambda-la...


Your original post said SDK, that's AWS's name for this:

https://aws.amazon.com/sdk-for-net/ https://aws.amazon.com/sdk-for-javascript/ etc.

Not this:

https://aws.amazon.com/cdk/

The person you're replying to was confused because you said "APIs and SDK" which clearly refers to the former, but then this followup post is referring to the latter. You can do everything with the SDK. The CDK is implemented on top of the SDK and cannot do everything.


The title has two grammatical errors with quotes. The corrected form would be:

> AWS staff spending ‘much of their time’ optimizing customers' clouds


We had a RDS db running at 100% when typical traffic is say 40/50%.

Support never found the solution, we had to scale up and out and completely replace the db two times for performance to return to the baseline.

All costed about 10k for what was essentially an aws bug.


I believe the next wave of SaaS is products that provide custom deployment, operational support, and cost optimization, where specialist software orgs can provide standard efficiencies and engage AWS for deeper looks.


Use vanilla products (K8S, "RDS", ...) so you can always dangle the threat of going on-prem or to another cloud provider.

Once you use Lambdas and other deep AWS product, you're locked in.


I don't personally feel like Lambdas themselves are the hard lock-in. It's the stuff around them that really gets into your code or forces specific patterns. SQS, SNS, Dynamo, Kinesis, EventBridge, etc.


In my opinion, SQS shouldn't be on that list.

You save a lot of time and resources just using SQS instead of setting up your own servers. And the lock-in is not as problematic as Lambdas etc.

It doesn't require much code change to move to an alternative message queue (RabbitMQ etc.), so the lock-in is minimal. In my projects, I just need to change a few lines of configuration code and the actual message processing code should remain the same.


Which frankly are the essential parts of a robust cloud architecture. Deploying a server is trivial - just dump it on any kind of compute resource. Managing fault-tolerance and monitoring, etc. between those services is the tricky part.


IME, not if you're using Terraform with proper modularization and can stick to the commonly available feature set. Have one layer of modules that describe what you want (e.g. a virtual server, database, monitoring) and behind that a separate layer that implements it (e.g. creates an EC2 instance, an RDS database or a CloudWatch alert).


Terraform is not “cloud agnostic” anymore than using Python with the various cloud providers SDKs.

Each provider within TF is tied tightly with the cloud provider.

Standard Disclaimer: yes I work at AWS and I do IAC with the CloudFormation or TF depends on what the customer wants


That's a nonstarter just because of how you describe the various services at each cloud provider. For instance, Google cloud is big into using "projects" as organizers for resources, where the closest analogue in AWS land is an entirely separate account.


In my experience writing cloud agnostic infra usually ends up more costlier. Not just in monetary terms running the infrastructure, but also in terms of time spent maintaining the infrastructure.

There’s generally very little point using the cloud if you’re not going to use cloud services.


If all you need is machines then there is still a massive benefit in using the cloud. You can spin up 1000 machines tomorrow without having to buy them, wait for getting them (2 months later), rack them, cable them, ....

Especially if you need a couple of racks things get complicated quickly. If you have never racked machines for a week you probably underestimate the amount of time it takes before you can actually run workloads on them.

And if you need a 1000 machines for x years which otherwise you'd put in your own data center you can absolutely go to AWS, GCP, OCI and bid them down against each other.


> If all you need is machines then there is still a massive benefit in using the cloud.

“massive” is an overstatement. “some” would be more accurate. And your example is correct, which is why I said “generally” in my post rather than “always” or “never”. In fact my current project is using AWS in exactly the way you described.

> If you have never racked machines

I have done. Many times. :)

And there are ways you can get machines at short notice without going down the AWS route. In fact it’s actually pretty easy and used to be really common. I used to use all sorts of different VPS and dedicated hosting for personal projects (while managing on prem stuff professionally). These days a lot of those bigger companies are now just wrapped up as part of “cloud services” but there are still some smaller providers out there doing things old school. In fact I have a few GFX cards for machine learning in a London data centre managed by one such independent provider.

> And if you need a 1000 machines for x years which otherwise you'd put in your own data center you can absolutely go to AWS, GCP, OCI and bid them down against each other.

You’re now talking about cost optimisation and commitment-based discounts, which is a whole other tangent.

Anyways, my point wasn’t that the cloud is more or less expensive than on prem. it was that designing your cloud architecture to be cloud agnostic is more expensive and rarely worthwhile compared with making use of the services offered by whichever cloud provider you’re using.


You could use containerized lambdas. Your application logic would be mostly safe to migrate around as need be.

But then again, migrating from one cloud to another is always a heavy lift no matter how it’s constructed.


Having done zero downtime migration from AWS to Google to managed servers (chasing credits, and then moving to Hetzner when the credits ran out): it's not much heavy lifting if you treat each cloud provider as just providing raw compute and storage resources.


It's because you did the heavy lifting up front with how your application is designed and how your infra is provisioned (I'm guessing everything is terraformed or something?).


Also a great unlock if you do the bulk of your work in kubernetes. Amazon is certainly doing some things in their kube implementation that will make moving suck more (karpenter, some of their other integrations) in an effort to be better and thus sticky... But it's still pretty easy to go from one kube cluster to another.


Everything in Docker containers + a service discovery layer (Consul on that particular case) will get you most of the way there as long as you don't get seduced into relying on cloud specific services.


And then you’re managing Consul clusters (been there done that).


I never found that a problem. Consul was robust when I used it.


With the lambda web adapter[1], you shouldn't need to make any code changes for web projects, just some dockerfile changes which are only used if it is running as a lambda, so the same image should still work on ecs or another cloud. [1]https://github.com/awslabs/aws-lambda-web-adapter


If possible can you suggest me a lof vanilla products. My team is building a platform that allows anyone to spinup instances on any cloud platform. We’re planning to add support for DNS management ( Route53, Google cloud DNS& Azure DNS) and storage buckets


Just what we need, another platform.


I don't see much difference in vendor lock-in between EKS and Lambda. The abstraction between your Kubernetes application and the underlying platform isn't airtight.


Of course. AWS apps are a mess and needlessly over complicated. Even if sales went on they'd still spend more time untangling the mess they've made for their clients.


IBM's Hybrid Cloud concept is promising.

Run your sensitive or expensive bulk resources like dbs onprem and go cloud for the rest. You can often save big just bringing your db onsite.


In the AWS world, you can do this by colocating servers at an AWS Direct Connect facility. You get a fiber connection from your rack to AWS's rack inside the building, and then AWS backhauls it on their network to the AWS Region. There are such colocation facilities located within a few miles of any major AWS Region for low latency (~1 msec) connections.


The concept is promising, agreed. But like most (all?) recent IBM products, the failure point will be in the execution of the concept. IBM is a marketing company at this point, not a serious technology company.


Execution and rent-seeking – their culture is based on high margins so they tend to design things which will put a lot of hard to migrate logic into their app and even when they know they’re competing against more affordable options the prices will be eye-watering. I assume that the whales who do buy generate enough revenue to make up for the lost sales but it doesn’t seem like a good long-term strategy.


Works if latency isn't an issue between app and db, which it often times is.


I did a cloud migration from a local datacenter to GCP and DB latency was the big issue for us. Latency from our DC to GCP was ~10ms, so an average page load with 10 queries added 100ms overhead which was mostly unacceptable.

Big enterprisey applications like this would love to only be doing 10 queries, 30+ is much more likely, and I bet they won't be moving from 1 DC to 1 cloud region, so 10ms is also unlikely. Back of the envelope suggests more like 1s of additional latency.

That would barely work for a big enough OLAP deployment, with all sorts of varied consumers, let alone OLTP.


If AWS wants to stop doing this, AWS can prioritize making optimizations easier for the customer to do...


are there enterprise grade products that don't end up spending a large amount of time providing enterprise services?


this is not new. AWS support would always help with this.


Lol optimize to help the customer or Amazon?


TL;DR: they want to rip you off less aggressively, so you don't ditch them on the short term.

Cloud computing has become so expensive that so many companies are moving away from it. And for good reason.

The only providers that are actually reasonable on pricing at the moment are OVH and Hetzner... by an order of magnitude (or two) cheaper than Google Cloud, AWS, Azure, Vercel, etc.

Trusting for any of the latter companies to not rip you off is like trusting a wolf to take care of the sheep herd. It just doesn't make any sense, as their incentives are many times at the opposite spectrum than their customers.


I expect almost everyone who tries to rebel and go "on prem" will flame out and come running back to the cloud

Ops talent has really dumbed down over the last decade or so (sorry no polite way to say it). Long gone are the days when the ops engineers were talented devs who just liked hanging out at datacenters. Nowadays most of them are just button pushers who are good for standing up SaaS stacks or other simple tasks, but I'd never ask any I've dealt with in the last ten years to build out a datacenter presence from nothing. Most have never set foot in a datacenter. On top of that, most modern ops teams have been downsized (thanks to cloud) that they wouldn't have the hands to get the job done.

Even for good ops people, it can be very hard to capacity-plan and understand what appropriate/affordable/useful hardware is. Most ops folks today have probably never purchased servers for production deployments.

AWS will be happy to welcome these companies back after their on-prem dreams die.


>Ops talent has really dumbed down over the last decade or so

Counterpoint, the ops required for your own infrastructure has greatly dumbed down.

It might surprise you, but OEM hardware solution have greatly evolved and simplified the same.

Now I can setup a core switch running 100 GBs with failovers by just plugging in some cables between them, setting a flag and they immediately start replicating config between themselves.

>Even for good ops people, it can be very hard to capacity-plan and understand what appropriate/affordable/useful hardware is. Most ops folks today have probably never purchased servers for production deployments.

I'll shock you once again, the same way AWS has sales and support departments that help spoonfeed what you need.

So do the hardware OEMs. I can get complete solution walkthroughs with Dell for example.


> Now I can setup a core switch running 100 GBs with failovers by just plugging in some cables

no ops person I have worked with in the last ten years knows what a switch is, and this is over many companies both startup and bigtech

no different for storage, compute...the experience and aptitude gap at most companies, even tech companies, is profound

ops has become devops, netops is no longer in their skillset


You missed a few "damn Millennials" in those comments.

Anecdotally, your comments do not line up with my experience at all. Not only is our Sr SysAdmin providing sage wisdom and networking guidance to our DevOps team, but several of our Jr guys are taking the time to learn networking specifically (certs and all). In fact, we'll have a few new Juniper specialists in 6 months or so.


I find this difficult to accept... none of your coworkers ever built their own PC? Set up their own home network?


It's believable for the kinds of organizations that the poster clearly works for.


You are obviously having enormous difficulty attracting good ops and infra programmers, but that's mostly going to be due to the obvious lack of respect and probably abysmal pay you are offering.

I would run away from an interview at a company you work for as fast as I possibly could just based on what you've said so far. And if I worked for you I'd quit.

Case in point, you seem completely incapable of understanding how your own lack of expertise and awful pay has driven all the experienced infra people to work for cloud vendors where they can get paid what they deserve and not have to grovel before a bunch of shitty javascript devs or equivalent.

Orgs like yours are trying to hire infra people the way you would hire a web developer. That isn't viable now and has never been viable. Infra roles are (or at least should be for any functional organization) more senior than product roles.


Counterpoint: I think that just means that on prem needs to close the tooling gap some. Cloud was a centralized point of leverage where improved tooling became worth investing in, but now we're at a point where cloud is vulnerable to somebody following in their footsteps by making on prem not require as much tending.

You still won't be able to do it with idiot monkeys. You just have to enable a small team of strong people. It may be getting harder to find those people, but it's not so bad if you only need a handful of them.

https://oxide.computer/ is shooting for this space, from what I can tell?


Ops talent didn't disappear, it just got centralized in Amazon/Google/Microsoft. Now that those companies are laying off thousands, it is likely that competent ops people with data-center experience will end up on the job market.


Standard disclaimer: Andy Jassy is my skip * 7 manager.

He has said plenty of times in public statements that less than 5% of all cloud spend is on any cloud provider.

https://accelerationeconomy.com/cloud/amazon-shocker-ceo-jas...


Eh, it depends. I've seen such inefficient use of cloud providers. Really depends on what cloud users need to do.


> Cloud computing has become so expensive that so many companies are moving away from it

Net? Sales figures don't show this.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: