Hacker News new | past | comments | ask | show | jobs | submit login
How We Saved $132k a Year With an IT Infrastructure Audit (buffer.com)
164 points by joshsharp on April 6, 2016 | hide | past | favorite | 110 comments



I will add another suggestion, if you use S3 at all... one of your largest costs is likely the bandwidth. Have you considered just placing caches in front?

I took a $100 per month S3 bill down to $5 per month by simply having existing Nginx servers enable a file cache.

It does help that I never need to cache purge (versions are saved in S3 and become URLs), but it was super trivial to just wipe out $95 per month of cost for zero extra spend.

My current setup is:

S3 contains user photos, web app (not on AWS) handles POST/GET for S3 (and storing local knowledge), Nginx at my edge has a cache that is currently around 28GB of files, and then CloudFlare in front of all of this saves me around 2TB of bandwidth per month.

The real gotcha for me is that I was relying on CDNs for my cache, but when CDNs reach 50+ PoPs I was starting to see multiple requests for the same thing as a result of people in different cities requesting a file. So the Nginx cache I've added mostly deals with this scenario and prevents additional S3 cost being incurred.


> The real gotcha for me is that I was relying on CDNs for my cache, but when CDNs reach 50+ PoPs I was starting to see multiple requests for the same thing as a result of people in different cities requesting a file.

Having never used a CDN, this sounds weird. This means that the caches are not synchronized between PoPs, even though they're supposed to be at the same level in an HTTP request. Is this normal behaviour for a CDN ? I'd expect one PoP to check with other PoPs before hitting upstream.


It's normal for most smaller CDNs to not have their PoPs communicate, yes.

With larger CDNs you start to get hierarchical caches: https://trafficserver.readthedocs.org/en/5.3.x/admin/hierach...

The theory being that the PoP closest to the origin is the one responsible for going to the origin and thus that other PoPs will fetch cached items for the closest PoP to origin.

Nearly all of the very large CDNs support some degree of hierarchical caching, and the ones becoming large are gaining the capability.

At CloudFlare (where I work) the need for a hierarchical cache was low priority until we started to rapidly increase our global capacity... once you reach a certain scale then the need for some way to have PoPs not visit the origin more than once for an item becomes very important. You can be sure we're working on that (if you are an Enterprise customer you could contact sales to enquire about beta testing for it).

But right now, for us and many other providers, just enabling an nginx cache in front of any expensive resource will help. By "expensive", I generally mean anything that will trigger extra expenditure or processing when it could be cached.

Edit: Additionally, nearly every CDN is operating an LRU cache. You haven't bought storage and not everything you've ever stored is being held in cache. You only need a few bots (GoogleBot, Yandex, Baidu, etc) constantly spidering to be pulling the long tail of files from your backend S3 if you haven't got your own cache in front of it. Hierarchical caching isn't a silver bullet that takes care of all potential costs incurred, but having your own cache is.


Thanks for the detailed answer. The only technical understanding of a CDN I have is from CoralCDN (http://www.coralcdn.org/), a p2p CDN operated by researchers on PlanetLab (with servers all around the world) that was especially created to mitigate flash crowds scenario, and their very documentation on this experiment (http://www.coralcdn.org/pubs/, particularly http://www.coralcdn.org/docs/coral-nsdi04.pdf). I was impressed by how the nodes automagically coordinate themselves so that as few as possible will query upstream, and then content will be slowly propagated to other nodes that were requesting for this exact content in a form of multicast tree, all with 0 administration and thanks to a smart DHT and a smart DNS implementation. Really cool stuff. I was under the impression that any commercial CDN was doing at least the same.


What size is the Cloudflare cache on the Free and Pro plans?


It's a question without an exact answer.

It depends where your origin is, where your users are (near 1 PoP? 2 PoPs? spread evenly globally?), how frequently files that can be cached are requested, the capacity of each PoP, the contention of each PoP, etc.

A few years ago when the customer and traffic growth rate exceeded the network expansion rate the answer was probably "not big enough", but we've since upgraded almost every PoP and added a huge number of new ones: https://www.cloudflare.com/network-map/

The answer now is "more than big enough".

We cache as much as possible, for as long as possible. The more requested a file, the more likely it is to be in the cache even if you're on the Free plan. Lots of logic is applied to this, more than could fit in this reply.

But importantly; there's no difference in how much you can cache between the plans. Wherever it is possible, we make the Free plan have as much capability as the other plans.


It depends on the CDN, but for example Fastly calls what the op was suggesting "origin shield": https://docs.fastly.com/guides/performance-tuning/shielding


Most CDNs by default make a request per POP. You can usually enable "origin shield", which will cause the CDN to make a single request and distribute the file to POPs internally.


I use KeyCDN and they offer an "Origin Shield" so there is only one request to the file and the PoPs request the file from the shield server. Other CDNs probably offer that as well.


I use https://www.netlify.com over S3, they're static website hosting but they let you remap urls to proxy w/e backend. A big chunk of bandwidth is included and it mitigates AWS per-request billing too.


this is a great one. We are going to be looking into this soon hopefully


I think your use of CloudFlare might be a violation of their terms of use:

https://www.cloudflare.com/terms/

Specifically, see section 10, "LIMITATION ON NON-HTML CACHING".


I work for CloudFlare.

The key part of that clause is this:

    the purpose of CloudFlare’s Service is to proxy web content, not store data.
    Using an account primarily as an online storage space, including the storage or
    caching of a disproportionate percentage of pictures, movies, audio files, or
    other non-HTML content, is prohibited
The site in question is also on CloudFlare, and so this clause does not apply. The cached assets are neither disproportionate nor is this being used as a storage space (S3 is still that).


Ah, OK. I thought the key concern was bandwidth, not storage.


It is.

From Amazon S3, who charge you for it.

Then to Linode who don't charge if you're below an allowance.

Then onto CloudFlare who do not charge you for it at all.

I was a CloudFlare customer long before being a CloudFlare employee.


How can CloudFlare afford to not charge for bandwidth at all? I figured the point of section 10 was to prevent abuse of that free bandwidth.


The clause is there to provide a tool to deal with some of the interesting things that people will try and use a CDN with no bandwidth charge for.

i.e. what if whole movies were given a different content type and file extension, would CloudFlare suddenly be the world's largest CDN of pirated movies?

Most such questions have been tried already, the clause allows us to answer "only for an exceptionally short amount of time".


I know! Because they pay for percentile, having high output makes their ddos input free


This is also true :)

And we can negotiate better prices with more traffic, etc.


A few people in the comments are saying 'isn't better to just setup your own SQL server instead of RDS?' and similar. I don't want to post a reply to each, so I will say it here.

While I can totally sympathize from a programmer point of view (setting up, tweaking stuff and all is a great fun), but you need to ask yourself whether it is in the interests of business to do so. Especially if you're working in a small team with no dedicated infrastructure staff or a startup with a short runway and a lot of urgent user facing changes.

Doing something on your own (e.g. setting up your own alternative to S3, or configuring your own SQL servers), comes with a cost and it's not only the programming/initial setup time. It's also opportunity cost (instead of setting up a server, I could, for example, analyze some user data); maintenance (more things to worry about which you can outsource); skills set required to run the infrastructure (running your own SQL cluster requires more knowledge, more training than running one on RDS), etc.

So is it in the interest of the business to run your own infrastructure?

If you have thousands of servers and spending millions on it - probably, but then probably you can make an attractive deal with GCE or AWS :)

If your application needs some complex performance related stuff which is harder to do in the cloud (e.g. some custom hardware or whatever), then again, running your own infrastructure might be better.

But if you are like the majority of the companies/products (you just need infrastructure to run reliably and performance should be just good enough), using AWS and friends might make a big difference.


If you come to the conclusion that the opportunity cost is too high for your team/company, fine...I can believe it....as long as you're also weighing the benefit of learning. All learning has an opportunity cost.

I do believe that someone who can setup and configure nginx to do load balancing, caching, rate limiting and build middlewares in LUA is potentially going to be a more productive full stack developer than someone who can't. As a programmer, having managed postgresql, has made me a more effective programmer. I have a good understanding of how it vacuum and collects statistics, the relationship between connections and sorts and work_mem, so I'm better equipped at writing queries and troubleshooting issues.

The gain to me personally and to my employer (and future employers) is not trivial.


This reminds me of recent discussions about how countries that outsource their manufacturing quickly lose knowledge of manufacturing technology and fall behind in innovation and self-reliance. I don't think we are there yet, but it is conceivable that in the future system administration could become a lost art to many. Something to consider as more of our infrastructure needs are met by the cloud. I'm sure I'm overstating this, but I figured I'd share anyway


It's true that you gain some technical knowledge by learning all that. The question is whether that is worth the opportunity cost of not learning other things or doing something else during that time.

Every minute you spend studying nginx config files or adjusting work_mem is a minute you are not spending working on your actual product or app.

To some extent it is indeed beneficial to know how all that stuff works. But at some point your time is better spent working on things that differentiate your specific company from others, rather than spending it on twiddling the same set of knobs that everyone else has.


Great point and I totally agree with you. Treating everything like a black box that just shove money into isn't going to take you far either :)


The biggest advantage that makes RDS and friends (DynamoDB, Redshift) so extremely attractive is that it (almost) completely offloads operating system and database server responsibilities onto Amazon. While there are some workloads that legitimately require physical hardware to execute on, there are a LOT of costs associated with that decision (networking, storage, power, hosting, licensing, etc.) that people tend to forget in cost comparisons.

I've worked with hardware for several years and love the challenges that come with it, but in most cases, the cost of debugging those problems and working with vendors to get fixes/replacements down isn't worth it. It's a HUGE time suck. I've been working with AWS for the last few months, and being able to safely dispose of a bad instance and spin up whatever underlying services that ran on it on something else is amazing.

If I were in charge of building infrastructure at an early-stage startup (or even one that's at a later-stage), I would absolutely 100% start with AWS and really squeeze every bit of performance out of the code for their products before thinking about going physical.


You still have to tune the configuration of RDS instances, and they can make some administrative tasks more difficult because they don't grant SUPER and/or don't allow system-level access. For example, you can't use innobackupex/xtrabackup on an RDS instance.

I appreciate that it is often worthwhile to just pay someone else to handle things for your company, but I think people are too quick to flip the switch one way or the other; they either run everything "in the cloud" or nothing there. The truth is that as with most things, there is a happy medium tailored to each company's specific circumstances.

AWS is very expensive. People underestimate how expensive it is. When we switched to AWS, our monthly bill was about 70% the total cost to buy (not rent, buy) all the machines in our old bare-metal datacenter, which were still perfectly performant (we switched because the execs wanted to be super-duper cool cloud users, not because of any real technical limitation, though there are pros and cons to it).

I have to think this is fairly common. I hear a popular trick to net a big savings in your first year as CFO is to just force the tech guys to cut out Amazon and replace it with a cheaper cloud provider or colocated bare-metal.


AWS is a platform. If all you want is a VPS, it's overpriced and underperformant (or so I hear).

If you just took a load off of bare metal and moved it to AWS, I don't doubt it got significantly more expensive... but that's a terrible mis-use of the platform.

When you develop an app around the products they offer, it starts to look a lot more reasonable. Especially at a small scale.

When I need a durable messaging service in my app, I don't need to go read up on rabbitmq, find a couple servers somewhere, set up some sort of redundant message queue so I get some level of durability, set up monitoring, deal with updates, debugging problems, etc. I just make some API calls to SQS and it's already there for me.

If "cut out Amazon and replace it with colocated bare-metal" is really that simple, then chances are you're not making very good use of the platform and you should absolutely switch off of it.


Right, I said that there wasn't really a technical reason for the move and that it was primarily motivated by political concerns. I'm not defending it.

I think you're right that a piecemeal approach is best. If you're using SQS because you identified it as a component that could be quickly and easily integrated at a lower cost than it would take to run apt-get install rabbitmq-server, then great. That may sound like a trivialization but for many companies the reality is that their "Linux guy" is a noob who can barely wield apt-get, and in these cases, something like SQS is indeed a good offering. In our case, it's definitely easier/cheaper/better to install RabbitMQ.

The ideal infrastructure combination is going to vary between companies based on the technical resources they have available and the technical requirements of their applications, but what I'm saying is that in general, there shouldn't be a default position of "all cloud" (super expensive, and also potentially time consuming) or "all metal" (super time consuming, and thus expensive).

I have to admit that there is something persistently annoying about your comment. I think it's the implication that if we switched everything to pure Amazon, it would somehow suddenly become a financially beneficial option. I think that is absurd and the only explanation for such a position is fanboyism.

Like I said, I'm sure that for some companies, money can be saved by using an appropriate combination of AWS services, but it shouldn't be taken as an implicit truth that it will apply to your circumstances, and that if it doesn't, you just need to further intertwine into Amazon's platform until it appears to becomes impractical to move off of it. I'm sure at that point you will be "saving money" by staying on Amazon, but only because you've created such a massive dependency on an external third-party for your company's operation that it'd take months to walk it back.

One last note. I do use an AWS service for my side project: Route53. It costs me less than $2/mo and provides a quick, easy, and powerful way to manipulate DNS records for my domains. For me, this makes a lot more sense than using the registrar's free DNS servers that take up to an hour to update or trying to run BIND myself. I'm completely open to using other AWS services when it makes sense.


The financial picture of AWS only makes sense if you are using it to reduce (either in absolute terms or in growth rate) your spend on IT staffing costs.

If you're paying 100% of your old IT staffing costs and paying the AWS bill, you're double paying part of it, IMO.


Sure, that's an argument that I'm sure crosses the threshold sometimes. However, with the amount of money we're paying AWS, we could've easily brought on several new full-time, dedicated guys to handle datacenter, hardware, colo, and sysadmin stuff and still come out ahead.

AWS is very expensive. People underestimate how expensive it is. I don't preclude the possibility that it's cost-effective for someone out there, but I seriously doubt that's the case for most of its users.


When our CEO told me we were moving to "the cloud," no amount of financial or technical explanation would convince him.

It was entirely about chasing VC dollars and converting CAPEX into OPEX. VCs hate CAPEX, and previous fundraising rounds at our previous venture highlighted our large CAPEX due to millions of dollars of physical hardware as a major turnoff.

So, I retired perfectly good 3-year-old servers that were fully amortized and replaced them with 4x as many instances, and brought up our monthly infrastructure spend by 10X.

10X! But at least now the VCs are happy.


AWS is much like renting a car. If you only need a car for a week out of the year, you're much better off renting one rather than buying one and leaving it idle for 51 weeks. But if you know you'll need a car every day, it's much cheaper to buy or long-term lease one rather than rent at the daily rate.

AWS is the daily-rate car rental. It's great when you have 1) highly variable load, and 2) your entire infrastructure is instrumented to scale up and down quickly with minimal manual intervention.

If, for example, you are running a large e-commerce site -- the case it was designed for -- AWS is great. Your load will be average most of the year, then go up a lot for Christmas shopping in November, then go up even more in December... then load goes back down to average for the remaining 10 months.

Part of the issue is also cultural. Younger developers who grew up hearing "cloud is awesome" every day think the very idea of colocating hardware is archaic and ridiculous, and won't even consider it.


Yeah, I agree with you. A good rollout would use bare-metal base hardware and scalable on-demand instances that can be auto-spawned and auto-decommissioned in the cloud.

However, I agree that the node.js generation is coming up believing that cloud is implicitly superior. Really I think this is a reaction to the fact that they don't know anything about system administration, either software or hardware, and they're trying to cover that up by claiming that anyone who uses real hardware is behind the times (in fact, they've even taken it to the extreme that anyone who uses a real server-side language is behind the times). It'd be great if we had more understanding of the marketing and political efforts in the tech world and had an entity that tried to counteract some of these fads before they got out of control. node.js itself would be a good target for such an entity.


Bringing on more people has additional hidden costs as well. HR, payroll, benefits/401K, management, staffing an on-call rotation for hardware issues (you still have to staff software on-call in either case), dealing with vacations, sickness, training, etc.

It's surprisingly expensive to run a full complement of web-facing IT services. Amazon (and Azure/GCE/RS/others) just make that calculus more explicit and in your face. (I also acknowledge that they are running AWS as a profitable and profit-seeking enterprise, so yes, they are marking things up beyond the lowest possible cost.)

For us, AWS is absolutely more expensive on a headline number basis. It's also decreased our latency of prod delivery and increased our agility as compared to on-prem solutions. We can get products in front of customers in hours/days that used to take weeks/months.


>Bringing on more people has additional hidden costs as well.

I'm including those costs in my estimation of how many people we could hire and still save money over using AWS. It's still several.


Bare metal is cheaper than cloud (not just AWS), sure. But you also have to know how to do bare metal right - you need the knowledge base, and that is a critical point. You don't want to be training up new developers in how to properly install and monitor raid cards, for example.

But if you only see AWS as "virtualised servers", you're probably not using them to the best you can.


This can definitely happen in some cases, but in plenty of other cases, moving off the cloud can give you an enormous cost savings _and_ dramatically improve your performance at the same time.

A coworker of mine was employee ~10 at a startup with some basic photogrammetry workloads. They ran jobs over ~0.1 to 1GB of images (lasting anywhere from minutes to hours) and operated primarily on Digital Ocean. They also had VPSes at AWS and Azure. My coworker helped them move this to LXC running on top of cheap last-gen dedicated servers. He built the entire (pretty simple) job management stack in about two weeks.

As a result, the company was able to scale to ~100x the previous number of jobs at roughly the same cost. The improvement was so dramatic that they were able to offer a free tier for their services - a business-model change enabled by cheap physical hardware.

In my own experience, I recently moved a production Postgres database off of Compose, who was charging us something like $400 a month for 2GB of RAM and 20GB of storage. I moved it to a hot-swap pair running in Rackspace on boxes that cost $2000 /month for the pair. That's 5x the cost. However, they each have 64GB of RAM and an 800GB SSD. The migration took me about two days total, including learning how to set up the hot-swap, and we have had very few operational issues with the database (knock on wood).

This decreased our mean page load times by an order of magnitude.

In some cases, dedicated hardware can really be worth the time cost.


I agree to that with respect to many AWS services more especially s3. However I find it hard to believe about RDS. I have spent thousands for dollars on AWS bills and never ever used RDS. Always setup my own db servers.


If saving money and giving your users a better experience are priorities then, in general, moving off AWS is worth considering. When it comes to EC2, spot instances or maybe you're doing it wrong.

Conservatively speaking, without bandwidth, you're looking EC2 costing 2-4x more than dedicated while being 2-4x slower. This does depend on specific workload, and the gap has been closing (conversely, I've seen specific workloads be worse than 4x slower).

I know RDS is convenient. But learning how to setup and manage your own database is actually a fundamental skill that will serve you well. All learning can be seen as an opportunity cost, but this is one that will let you save money every month, and give your users a better experience.


Do you have any place you can point for that?

Anecdotally, I have found that when we account for...

- Human resources costs (payroll, taxes, benefits) [1]

- Time to market for new features / products

- Utilizing reserved instances where it makes sense

- Appropriately sizing machines

We get with AWS...

- Faster time to market (accelerated revenue)

- Relatively the same cost per month (cheaper in some areas more expensive in others)

- Significantly lower initial investment

- Increased redundancy (via many small servers vs. a few large servers) / decreased disaster recovery times (and in many cases automated recovery)

I'm not saying you're wrong, that's just what I've seen when we run the numbers internally. You may have seen differently which is why I ask.

[1] A devops person to manage servers dedicated or not can cost $80-$125k+ a year after benefits and taxes. That is a lot of AWS instances. And we have found we need an IT staff about half the size to manage AWS vs a dedicated data center.


For my last start-up, I moved off EC2 to a dedicated vSphere cluster with a hosted provider. vSphere has an API, so adapting existing provisioning & deployment code was quite straightforward (I used the fog gem in Ruby). I basically treated the vSphere cluster very similarly to EC2 (small root stores, attached volumes, etc). Granted, I did have to give up the benefit of being in different regions.

I found the maintenance burden dropped substantially. It may have just been that I was running on newer hardware, but vSphere has built-in HA features such that VMs will just migrate between hosts when hardware degrades. In the two years I ran that setup, I never lost a VM.

I also had dedicated, modern hardware -- no need to worry about CPU steal. I could create instances/VMs of any size I want. If I needed to resize a VM, I could do so without losing all the data on it. As long as I had capacity, I could add new VMs at no extra cost. When I needed more capacity, I'd just call up and have them add a new blade to the cluster, which both increased my redundancy and gave me extra capacity. Essentially, the cost curve, while tied to usage, was much more favorable than linear growth once you get over the base setup costs.

On top of that, I had a real console to each of the VMs should anything go wrong. And if I couldn't fix something myself, there was dedicated staff at the colo facility I could just call up.

It's not for everyone, but there are options out there that give you an EC2-like environment with many of the benefits of having your own hardware.


The tradeoff for CPU stealing is that now you have multiple services running on the same machine so if the machine goes down you lose them all. On AWS it is very unlikely that two of your VMs end up on the same physical machine.

Also, I actually can't recall (in 8 years of AWS) a time where I saw CPU being stolen. Perhaps I've been lucky. I also don't watch for it constantly.

We actually did this with our dev servers. Although switch VSphere for VirtualBox on Linux and colo for a server room in our office. It worked great and saved a TON of cost. But we don't need pesky things like 24hr uptime and dedicated bandwidth for dev servers.


Well, this is what I was referring to with the HA features of vSphere. It works basically like a compute level RAID. The VMs were stored on a SAN and vSphere monitored each of the compute blades. If one went down, the VMs were seamlessly migrated to a hot spare blade. The vSphere docs claim this can be achieved with zero packet loss -- a claim I was never able to test or verify. If you're worried about losing multiple machines, just add multiple hot spares.

Of course, this doesn't help if you lose an entire rack or the data center. I concede this was a trade-off. But given how many times an entire region went down when I was on EC2, I was satisfied with the risk based on the colo environment's uptime record. The provider did offer another facility, but the latency between the two was too high to be of practical use in a failover without keeping a completely mirrored configuration in both locations.

It sounds like you have some experience with vSphere, so I don't intend to be patronizing. But there's a huge difference between "enterprise" virtualization and what you get with an ad hoc setup using desktop virtualization tools.


> The tradeoff for CPU stealing is that now you have multiple services running on the same machine so if the machine goes down you lose them all. On AWS it is very unlikely that two of your VMs end up on the same physical machine.

Yes. But its also very likely that when AWS has issues, the entire region is going to be having problems, like last Saturday (IAM, EC2, Autoscaling, etc broken badly for 6 hours).

> But we don't need pesky things like 24hr uptime and dedicated bandwidth for dev servers.

You're not getting this SLA unless you're managing everything yourself and are globally redundant. AWS doesn't provide bandwidth guarantees, nor will you get 100% uptime.


True. You have to design for failure with AWS. Which does have a large amount of cognitive overhead. If you don't design for failure you're going to have a bad time.

However, designing for strict SLAs is not impossible with AWS. You just need to have multi-region redundancy and you can get very very good with multi-availability zone redundancy.

I have no excuse for the IAM outage, it sucked for our ops team. I guess my only two consultations are:

1. We haven't had a customer-visible outage due to AWS in years because we follow best practices (which does cost more - but see my previous point on more smaller machines vs many large machines)

2. If we were running our own authentication and access control system similar to IAM, it too could have an outage.

But I agree, that is a bad thing about AWS.


Being on AWS does not necessarily mean that you don't need a devops person - especially at the scale where not being on AWS actually makes a difference to your margins.

I have seen quite a few people move off AWS successfully onto bare-metal leased hardware. S3 is just about the only service that's difficult to find an alternative for. Personally, I find using something like DynamoDB no different than using an Oracle DB - it's a vendor lock-in. Unless you had enterprise-level support on AWS (which costs a lot) - if you run into issues with Amazon's proprietary services, then good luck to you.

AWS is great to get started, but once you know that you're going to need scale (and lots of infra), it's best to move.

I say all of this as someone who has extensive operating experience on AWS. YMMV.


Oh, with cloud services you defiantly need devops.

Without AWS you also need dedicated infrastructure, networking and hardware people as well as devops.

You need people who know how to configure cisco networking gear, who understand SAN's, iSCSI, Fibrechannel, Racks, blades lots of stuff even devops people don't think about.


What exactly is "DevOps" to you? I ask, because almost everyone has a different answer. I've been doing "DevOps" for more than ten years, and all of the items you listed as required skill-sets are within my capabilities and have been used at many of the places I've worked. I'd be hard pressed to call someone an ops person if they don't understand the basics of server hardware, networking, and storage. These are essential components which the system relies on, the same system you are responsible for the uptime of.

I often hear things like your statements and can't help but wonder if the general quality of ops people is so bad in our industry and I just haven't encountered it, or if the reason ops people are treated so poorly in most organizations is just that developers automatically assume we don't know anything rather than asking.


Devops for me would be things like puppet, networking(subnets, load balancing, firewalls etc) deployments, Cloud Formation templates, ARM Templates rather than directly setting up hardware.

Where a dedicated networking would know specifics of certain vender. You can make career out of just knowing how to set up CISCO hardware and CISCO's embedded os. Devop's people tend to be broader than that.


All of the items you listed fall under my definition of "DevOps" as well. I loosely define it in two ways:

1) "DevOps is a philosophy, not a title" (this is mostly because of managers thinking otherwise)

2) "DevOps is about focusing on automation of systems infrastructure to improve reliability, flexibility, and security."

Regarding #2, though, since my past experience includes building public clouds, my perspective does not limit "DevOps" to only utilizing public clouds. You can automate the build-out of physical hardware too. It's not really possible to automate rack n stack, but you can abstract that away through external logistics vendors that pre-rack/cable gear for you at a certain scalability point.

Things like OpenStack Ironic, Dell Crowbar, Cobbler, Foreman, etc. are definitely DevOps tools, yet they are specifically focused on handling automation of physical hardware deployments.

As a further example, many networking vendors now provide APIs, but even when they didn't they had SSH interfaces. It was very possible to automate the deployment of large quantities of networking gear using remote-execution tools like Ansible or even just Ruby or BASH scripts. There's no need necessarily to have a dedicated networking person.

Of course, as you scale up to a certain point in physical gear, it pays to have specialization. But that's true even in the cloud, where you may need to hire a specialist to deal with your databases, a specialist to deal with complexities of geographical scale/distributed systems, a specialist to deal with complex cloud networking (VPC et al). Just because something is abstracted away into a virtual space doesn't necessarily reduce its complexity or the base skillsets required to operate that infrastructure.


No. This is true for colocated, but absolutely false for dedicated.


You are correct! If you're just colocating, you need your own people to manage your gear. Dedicated equipment is managed by the service provider.

Disclaimer: Provided hosting services for ~8 years.


Companies that provide managed hosting for hardware tend be just as expensive as cloud providers.

These comparisons tend to be, oh look you can buy the the Dell Server for cheap, and shove it in data centre and it's a lot cheaper.

If you start going route where you ask the provider to do it, the charges tend to be a lot more. It usually the sort thing where you need to do a phone call before you can even get a quote.


Again, this is demonstrably false. What's commonly meant by "dedicated hosting" is essentially the same service level you get from EC2 - they they care of the network and hardware, you take care of the software (1).

The cost here is 2x-4x lower. In another comment, I quoted E5-2620 v3 w/64GB for $400 which is exactly this type of setup. This is offered by a company that's been in business for longer than AWS has existed and who, in web hosting circles, is well respected. You could definitely go cheaper. You could definitely go to IBM and rackspace and pay as much as EC2, sure...but there are literally thousands of providers in the US, which have been in business for over a decade that'll beat EC2/Rackspace/Softlayer by a wide margin.

Some that I've personally used: WebNX (LA), NetDepot (Atlanta), ReliableSite (NY), HiVelocity (Florida). I've also done it on the cheap with OVH (both Quebec and France) and Hetzner (Germany), as well as on the expensive side with Softlayer (you can negotiate softlayer down considerably even on a small order).

(1) That's disservice to dedicated hosting, because the quality of the network is often better, and you won't get termination emails from your hosting provider or noisy neighbours the way like you do on EC2.


Will those guys provide you with SQL Server cluster with at least 2 servers, Setup AlwaysOn, Setup failover clustering with Quorums, Optimal Disk Partition Alignment, DTC, setup subnets, ACLS to secure your cluster and lots of MSSQL Specific stuff I don't know about and then monitor it for you. Because that's what managed service is. And most ask a lot for this because of the specialists it requires.

If you pay for RDS, Amazon does this for you, and it would already be well setup below the application level. They have guys monitoring it, and keeping it up. And they can do this cheap because of the benefits of scale across all their customers.

If you just ask for machine than yes, that's going to be cheap. But your forgetting you need databases, Application delivery controllers, firewalls, VPN appliances. All of which may require niche vender knowledge to setup, and then they would charge a lot of consultancy fee's to design a solution for you. Amazon puts this stuff behind simple API's where you don't need to know all those vender specific skills.


I don't mean to be rude, but Github, Stackoverflow, and Wikipedia all run their own physical environments. Its easy to demonstrate that the cost savings you indicate in AWS don't exist.

AWS helps you prototype and iterate faster. It is not cheaper.


And I know a lots of companies that host on cloud, and made a cost saving after moving away from managed hosting because they were charging so much to look after it.

Github, Stackoverflow, and Wikipedia probably have very good dedicated specialists who salaries only make sense when you operate on their scale.


> And I know a lots of companies that host on cloud, and made a cost saving after moving away from managed hosting.

Can you provide a citation? Because AWS tools _are managed hosting_. They're just managed hosting without support (unless you're paying AWS for it on top of the service cost).


Well they tend to be smaller/medium size companies, because they don't have serious scale yet to hire specialists, but at the same time require reliable HA hosting. It can't just be some databases or web servers setup in a Adhoc fashion.

A lot of these articles that compare cloud vs dedicated only compare some web server in data centre.

They don't consider building a complete platform is a lot of work. You need to setup databases in highly available preformant fashion, backup solutions, off premise backups, ADC's(Netscalers for example), firewalls, subnets, Storage(SANS), ACLS, Site to site VPNS(or MPLS) etc which requires some knowledgeable people to setup. Companies that setup this for you, rightly charge quite a bit for it.


This is almost a standard product with a lot of providers :

http://www.postgresql.org/support/professional_hosting/north...

> If you just ask for machine than yes, that's going to be cheap. But your forgetting you need databases, Application delivery controllers, firewalls, VPN appliances. All of ...

What do vpn appliances have to do with database hosting ? I'm guessing you're an enterprise java developer ?


Notice how all of those companies present themselves as consultancies.

The rest just show virtual machine prices, dedicated machine prices etc. You have to dig around the websites to find anything about "complete" database solutions. They present you with a phone number and you will have to phone them up, and it won't be cheap. They will charge consultancy fees.

I'm not talking about database hosting. I'm talking about setting up a web hosting platform. A lot of companies would like secure access via VPN. Imagine you have some network appliance, you would probably want its admin console completely cut-off(via ACL) via public internet. You would access it via site to site vpn instead.


Having worked at one of these for ~6 years ... no they don't (well, it is true that you can get extra services of course, and I bet there are some that do charge, but in general, they don't).

For having a hosted + backed up database that's redundant (I hear master-slave with auto-promotion is quite common, mysql and postgresql are available) and that THEY will fix when it becomes unreachable, or cable cut or ... that's a standard product, and there's no fees for fixing it when it goes down.

All I know is that the most expensive dedicated customer we had (which we did a database for) paid less than $20k, and he had about 800000 DAUs (and is a well known site).

Also, in all the time I've been in the industry, I've known a single instance of hardware failure leading to data loss. I've heard of dozens of times of perfectly authorized accidentally erasing/corrupting the database (mostly because they wanted me to figure out how to repair it). This is the real threat, and will require consultancy to fix, given the average level of technical competency of cloud customers, both on cloud and dedicated.

Another one was a dating site that had less than half that in charges (no idea about DAUs), moved to amazon, balked at a bill that (after serious optimization and large customer discounts) exceeded $100k mostly moved back to us (only using amazon as backup location).

Cloud has a number of advantages, but price is definitely not one of them. And there's serious "fine print", like that most of the cloud advantages (like "not losing data when a physical machine dies") don't apply to small customers.

The big secret is that scaling customer apps generally runs into issues because the customer's programmers are not considering efficiency at all. Letting me loose on their code results in 20-30 TIMES improvement within hours, not because I'm that good, but because usually the first factor 10 is simply not getting the same data 10 times from the DB. No real difference between languages in the programmer's skill level (with exceptions for rare languages : Haskell programmers, for instance, are definitely better, but that will stop if it ever becomes popular). There are exceptions, of course, but this is the common situation. Having done 2 cloud migrations in the last year I worked at one of these managed hosting shops, scaling doesn't work any better on cloud, in fact it often works far worse because PHP's performance far exceeds what you get with ruby or python/django. When you exceed shared serving you can move up to managed/dedicated for a big performance boost (in fact the hoster will probably do this for free when you upgrade your plan), which is likely to last you for more growth. The issue is that PHP shared hosting in my experience often serves more customers than 4-5 dedicated django or RoR servers can (mostly because like all ORM, django's ORM means 40-50 database calls per page showing, RoR is no different). And unlike cloud, serving pictures or even video will result in your site slowing down, not in a $10k bill.

That said, there are things the cloud is good for. We fail badly for customers that require having truly large amounts of diskspace simply accessible (we're talking 10+ Tb or so, that's constantly accessed, before this becomes an issue), or if they need massive amounts of compute that needs to scale with very little warning. Even in those cases, the times I've seen they quickly decided that storing some files in the cloud through remote filesystems is far preferable to actually running the site on the cloud, due to cost. AWS lambda is pretty useful for this: quickly run some code on input, store result on S3, don't keep the remote task running. Uploading java code is easy, so if the task is processing/tagging/creating reports/pdfs/... it's easy to code up, test locally, and packaged as a jar it will work.


Ok, can you give me a link to a clear pricing structure that offers this?

We use C#, .Net platform, and write sql directly, so language performance isn't really an issue for us.

I've had disks blow up on me before, but didn't lose any data because of HA.

I've also had power supply issues before. If we had been running baremetal, the automatic VM migration could not have worked.


I would have agreed about specialization in older versions of MSSQL but they specifically made the HA sysadmin experience with AlwaysOn a cakewalk. I think any sysadmin with a decent amount of general experience could get it all figured out within 1-2 normal workdays (including your average interruptions).


you don't need any of this stuff, there are providers out there that will give you all of this for a cost less than amazon's.

the problem is nobody wants to do it because it's not sexy, or name brand / won't-get-fired-for-hiring-IBM.


Well you do want it, if your running 30 million a year website off it. A little amount of downtime can cost a lot.


talk about missing the point.

it's hard to sign up to a provider nobody has ever heard of. doesn't matter if you're making 30 mil or 3 thousand dollars.


B2 kills s3


Maybe once it leaves beta.


All valid points.

> Being on AWS does not necessarily mean that you don't need a devops person - especially at the scale where not being on AWS actually makes a difference to your margins.

Agreed. I didn't mean to imply that. On our scale it means we need 2 instead of 4. And for consulting business, 1 person can handle multiple smaller clients so each client doesn't need to take on the full cost.

Incidentally, since moving some of our self-managed servers to AWS services we have seen a drastic decrease in how often our on-call engineer is woken up at 3 AM. Which makes them happy. And happy employees are always a good investment. Admittedly, given enough time we could have made our stuff as resilient as AWS.

Also, being able to call Amazon support and get a second eye on things helps. Albeit, the support plan is a bit pricy. In our case $10k a year.

> I have seen quite a few people move off AWS successfully onto bare-metal leased hardware. S3 is just about the only service that's difficult to find an alternative for. Personally, I find using something like DynamoDB no different than using an Oracle DB - it's a vendor lock-in. Unless you had enterprise-level support on AWS (which costs a lot) - if you run into issues with Amazon's proprietary services, then good luck to you.

Vendor lock-in is a major issue. Which is why we used self-managed databases on AWS for years (vs RDS, Redshift, or DynamoDB). Our conclusion -- after a few years -- was in our case (YMMV) we could accept the vendor lock-in but make sure our code was abstracted in a way that make moving easier.

Plus with AWS's new Database Migration System moving databases off (or on) to AWS is pretty easy.

Also, there are actually some really good S3 alternatives now. Their names escape me. I'll look them up later. However, I've seen many companies use only S3, Glacier, and CloudFront but not EC2. Your servers don't need to be on AWS to use them, obviously.

Almost all AWS services have good open source alternatives. And we have spent the time to make sure system is architected and our code is written in a way that has a clear path to switching. Microservices really help here.

> AWS is great to get started, but once you know that you're going to need scale (and lots of infra), it's best to move. I say all of this as someone who has extensive operating experience on AWS. YMMV.

Maybe. But once you buy reserved instances in AWS the cost is pretty low. And when hardware fails in AWS it doesn't cost you more. (Assuming your architected without single points of failure) I've found a lot of people did the math before AWS lowered their costs and introduced reserved instances. Either way, this is why I said "cheaper in some ways and more expensive in others"


As for performance numbers, you'll get better performance from a $400/m dual E5-2620 v3 w/64GB of RAM than a $1200/m c4.8xlarge.

The E5-2620 will also include a lot of bandwidth (which alone could save you thousands of dollars a month), and significantly better I/O (>512GB SSD + 1 larger spinning, or maybe RAID with BBU).

The gap is probably worse right now as v5 chips are hitting the market.

Even at 100 servers, the price/performance difference doesn't cover the cost of 1 devops person. You're right. But I don't think AWS saves even a little devops time, unless you deeply lock-in.

I hope when you're considering the price, you are also factoring in time developers are spending on performance and architecture for AWS versus simply scaling up. Even if, as you say, AWS saves you 2 out of 4 devops roles, if it's costing 50 developers 10% of their time, you're way behind.


I guess in my particular case I don't need a dual E5-2620 v3 w/64GB of RAM.

The only way I could see needing that for our product is if we were virtualizing our own infrastructure.

But, AWS is not a 1 size fits all case. In that case, AWS probably doesn't make sense.

Side notes: you also need to consider the electricity, setup time, cost of hardware failure, networking access, rack space, etc. The hardware itself is not the most expensive part.

Side note two: where are you getting those prices? I've seen just the motherboard and 1u case cost close to that much. I'm assuming those are workstation prices not rack mount... but if they are rack mount please give me your supplier :)


Pick a smaller ec2 instance, and I'll find a comparable dedicated server that yields a similar price/perf gain (true, it might be more pronounced on the high end...).

Your two side notes tell me that you're conflating dedicated and colocated hosting. I can see now why you think EC2 saves you devops if you think colocated is your only choice.

I'm talking about dedicated hosting, so, no, I don't need to worry about electricity, setup time (not as much as you mean anyways), hardware failure, network access or rackspace. This is an extremely common model.

The price that I quoted I just got quickly from Hivelocity.com (1). The price is actually $300, not $400...I'm not affiliated with them at all. They've been in business for longer than AWS has existed. I could have gotten a similar price from a thousand companies.

(1) https://store.hivelocity.net/product/125/customize/1/


You are right, I was thinking colo more than dedicated.

Also, I'm not sure if everyone knows, you can get dedicated from AWS now. It is expensive, though.


Relatively the same cost per month

For us, AWS is not even remotely close in cost to a rack, hardware, and staff. The estimated spend for us to duplicate our dedicated rack setup in AWS is 3 times our monthly operational costs. And this does not include using any additional services. And to be clear, this includes operational staff, spare hardware, disaster recovery, etc.


One thing I found surprising on reading this is that they essentially spent $10k and two months ($5k listed as saved * 2 months) to figure out they weren't using their logging infrastructure any more. Wish I could be that gung-ho with resources!


The title should really be "How we stopped wasting $132k a Year With an IT Infrastructure Audit." The article certainly leaves the impression that a lack of processes, procedures and basic change management left you in a position where you were wasting 132k a year. As I read it the article shows some recognition of that fact but a false dichotomy between innovation, time to market and change control is used to justify not properly addressing the issue. Thus it will most likely strike again but in a more painful manner.


https://buffer.baremetrics.com/ Specifically https://buffer.baremetrics.com/stats/mrr#start_date=2015-10-...

Buffer has grown from $509K 12 months ago, to $663K a month 6 months ago, to $782K a month now. If being able to quickly iterate helped at all, that $132K is more than covered by the MRR growth from 12 months ago to today.


You're justifying wasting money because the company is making money? While I think the tone of the comment above is pretty inflammatory, the person is totally right. It was the first thing I thought of when I read the title.

If you're spending $132k/mo on hosting to make only $782k/mo, you're doing it wrong.


Despite the fact all your numbers are wrong, I don't get why people think saving money on servers is the most important goal.

Revenue - expenses = profitability. You can move either of those dials, and revenue is the bigger lever because it has no upper bound - expenses can never get to zero, but revenue can grow to any amount.

If a business is anywhere near breakeven, and can increase expenses by 10%, and revenue by 15%, it should do it.

Simple equation: grow by $300,000 PER MONTH or save $11K per month - which is better? I say chase the revenue growth over some minor cost cutting, but that is just me.

I dunno, but I reckon Buffer knows a little something about growing...


Where did $132K/mo figure come from? The only place I see that quoted is in an annual savings number.


You're right... they spend even more than $132k/month. They just trimmed the total amount. So they are even worse off. =(


Is there evidence for that in the article, or supposition?


I wonder where the tipping is at which it's more economical to own/lease your infrastructure instead of AWS/Azure/etc.?

The company is certainly large enough to have their own infrastructure team.

But granted, migrating an entire system that makes extensive use of the AWS ecosystem is anything but trivial.


In terms of scale, for EC2 and all services that run on it, it's more economical at ALL points to rent (dedicated) or buy (colocate). S3 is more competitive, so that question becomes more interesting.

There might be specific use-cases/apps/business with high burstiness where the general case isn't true. There might also be teams that can't manage it.


I don't believe the "at ALL points" is true. What if I'm running a small system myself? My time is worth something to me and I'd rather not have to think about setting up my own backups and monitoring. On AWS I can do this with a few clicks. I can manage it, but I'd rather not.


The AWS services are a pretty good moat. Building and running your own S3 and EBS is not quick, cheap or easy.


Careful about ALL. What about something that fits in the free tier? OR $1 a month Lambda tier?


That's a minor exception not worth adding a footnote for.


So it's cool to just blindly say ALL when you mean "Many" ?


I could cop-out and say "but if it costs you 10x more after the free-tier is expired, is it really free?"

But I can admit that I forgot abut the free-tier. It's a good deal and a smart business move by amazon. Sorry.


High sustained use with little variance in resource demand (easy for HPC, storage-centric services). No need for geo-distribution (orbackblaze) big enough you'd need multiple data centers anyway (dropbox). For specialized uses, big enough to design and manufacture specialized cases up through motherboards (dropbox, backblaze).


The grass always greener, I know many companies moving their infrastructure to the cloud.


My calculations showed that as few as 10 instances was enough to make self-hosting cheaper and easier.


The money saved isn't nearly as important as the reduced complexity and exposure gained from a regular spring cleaning.


Am I the only one who initially read this as "How We Saved $132k a Year With an IT Infrastructure Adult"? Having an IT Infrastructure Adult is essential, after all.


The first thing the adult does is order an audit, so it amounts to the same thing.


>For the longest while we’ve used fluent.d[sic] to log events in our systems.

As a maintainer, glad to see Fluentd there =) If folks have questions on Fluentd, I'd be happy to answer them here.


Will add my personal experience regarding RDS .

If you are using RDS with provisioned IOPS , you can reduce your bill dramatically by downgrading to General purpose SSD, I know certain applications might really need the dedicated IOPS, but its better if you monitor your read / write rates and decide accordingly.


So they put a team of engineers on a problem and managed to save an amount roughly equivalent to the salary of one of those engineers? Also, don't AWS resources generally get cheaper with time?


nstart already pointed out that it didn't take long to do it, but... i've seen this attitude over and over again - "expenses are cheap, engineers are expensive, money will just keep flowing", etc.

We don't all live in worlds of unlimited budgets. For them to have taken 2 weeks (maybe $5k of effort) to save $130k plus is phenomenal (and yet, also just mundane). This means more money can be spent on hiring someone else, or higher profit sharing for all, vs just continually and marginally increasing someone else's bottom line.

As buffer grows, their needs will grow and expenses will grow. I hate to push out predictions too far in the future, but this $5k of effort is probably going to save them $500k over the next 3-5 years.

What's a bit irritating in all this is that the folks behind it probably won't be rewarded correspondingly (although buffer seems a bit more open and egalitarian about these sorts of things).


I'm glad you called out the cumulative savings benefit.

I'll add that they also gain a benefit multiplier in the form of the knowledge they gained from the exercise. Something that will hopefully be carried through into their future infrastructure decision making processes.


I was going to mention that too - the benefit to future projects, budgets and employers is potentially quite large.


hey there! One of the engineers from Buffer here :). Good spot on the numbers being roughly equivalent to the salary of one of the engineers. Thanks for bringing that up. This was something we were mindful about and so the work we did took us roughly 12-15 man days (3 of us working on it in fits and starts * some rough calculations :D ). As important though is we did take away a lot of lessons from this that will hopefully save us even more in the future which is one of the major goals here :)


Assuming a simplistic $132k to save $132k annually, a 2 year net ROI of....$132k isn't so shabby :)


The amount saved affords an extra C-level hire. What intrigues me though, is what required them jumping from 25 to 80 people.


One thing that really stands out is they aren't running reserved instances. That must be incredibly expensive.


Any way this could be done with Microsoft Azure as well?


Well, obviously. The key point is to review what you have, get rid of what you don't need and negotiate better pricing for what you do need. If you aren't doing this at least annually anyway, you are leaving money on the table. The platform doesn't matter.


I was able to save a small, niche web hosting company with it's own proprietary CRM around $112k annually. I also greatly improved security in their shared hosting setup and automated code deployments and new customer onboarding.

I was able to do this by using Edgecast CDN as a caching proxy for all anonymous traffic. This reduced the load on their servers greatly and we were able to decimate the number of servers required. Rackspace servers are incredibly expensive and this represented a big savings.

We could have cached the pages in other ways, but this had the added benefit of serving anonymous requests from edge nodes and this reduced page load time by a great bit.

It was a big migration with sometimes maddening constraints imposed by business necessities and technical debt, but in the end we were able to eliminate a good bit of that debt.

The most frustrating part of the process was having to deal with sales reps that kept trying to push "cloud" solutions as panacea to all scaling challenges.

The bandwidth and storage costs from using something like S3 would have been atrocious. The rackspace "cloud" solutions all would have had unacceptable latency problems.

And it would have required impossible code rewrites. One of the requirements for the project that was incredibly frustrating was that we could not force updates of the PHP CRM to any particular client. We offered an upgrade path, but we had dozens of different versions of the software running on the servers, along with Wordpress and other PHP/MySQL apps installed at customer request.

Shared PHP web hosting is one of the most difficult environments to work in. Each account was a petri dish of whatever customers uploaded via their docroot FTP access.

I pushed through a lot of changes to eliminate that practice and lock users down to FTP access for directories that would not execute PHP.

I also had the company move all Wordpress installs to Flywheel, to eliminate all the maintenance and security implications for Wordpress to a company that focused on just that. This allowed the company to focus on it's own CRM and nothing else.

All of it came at a really key time because competitive pressure from squarespace forced the company to drastically reduce prices.

When I pitched the original idea for the project, the entire internal team didn't tell me about the security issues, the multitude of versions running, the full FTP docroot access or even the existence of Wordpress on the servers.

When I discovered how FUBAR the entire setup was, as a contractor it would have been easy for me to bail, and probably personally healthier for my stress levels (I took off three months after it was all over to relax), but I stuck with them and brought them through the entire process to a successful conclusion.

I'm pretty proud of that project. I pulled it all off under some of the most difficult and irrational conditions one could imagine.

And the ROI for the company was insanely great.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: