> If you are using a compute optimized AWS c4.4xlarge instance: you will be able...

ChuckMcM · on Feb 17, 2016

Having done this computation a number of times, with these machines it would flip over to 'colo' at about 150 machines. The really interesting thing is that amazon costs scale linearly with size while owned infrastructure scale fractionally. So the advantage just keeps growing and growing.

Bottom line, if you can host your app and resources on less than a 150 machines in Amazon you win, more than that and you're leaving money on the table. Once you get to the point where you are deploying new datacenters to support your customers you get a huge boost in operational efficiency.

kayoone · on Feb 17, 2016

Still not convinced about this. I can get a pretty beefy server from Hetzner for 60EUR per month. Skylake 6700, 64 GB RAM, 500 GB Raid1 SSD. One time setup fee of about EUR 100 but can be canceled on months end. On AWS an equally specced machine would run >300 USD per months easily. So it really only makes sense if you can scale your usage up and down a lot during the day. Does not make sense for most of smaller projects. Most of the companies i worked with had an EC2 setup of 3-10 instances that were mostly always up, which is way more expensive than doing the same thing on Hetzner. Basically they just did it because of free AWS credits for startups.

ChuckMcM · on Feb 17, 2016

Slight digression here but "RAID 1 SSD" might not be a good idea, especially if you bought two SSDs at the same time. They are probably from the same lot. My experience with 3,000 SSDs in use is that they seem to fail with a combination of age and rewrites. So I would be concerned that two equivalently aged SSDs with an equivalent set of rewrites (as RAID 1 does) would be quite likely to fail at the same time, or more precisely, when one failed you probably have a statistically harder time recovering the mirror from the other.

That is just statistics of course, but one of the things I tried to do in the Blekko infra-structure was to mix the ages of the SSDs to mitigate this risk.

rbanffy · on Feb 17, 2016

This is good advice even for spinning disks. I once met an unlucky RAID-5 array of three identical disks from the same brand, model, batch and with close serial numbers. When one of them failed, I immediately ordered 3 replacements.

They didn't arrive on time.

Good thing I had backups.

MichaelRenor · on Feb 19, 2016

I've heard there's strong correlation in failures by lot number. Can you confirm this?

What's the practical way to avoid this? Staggering your SSD purchases?

hhw · on Feb 17, 2016

This is a false dichotomy. There's no need to flip to colo, when dedicated hosting exists. The provider takes care of the network, hardware, and potentially much more depending on the service agreement. You can even have the provider setup a cloud environment for you, if you want to be able to spin up and down VMs. This eliminates the capex costs, while having much lower opex costs than colo at smaller scale.

Considering that you can easily do dedicated for 1/7th the cost of Amazon for the equivalent amount of resources, it doesn't take anywhere near 150 machines to realize cost savings, it's more on the order of one complete physical machine as a dedicated server. You can purchase a 2nd one for redundancy and/or for always available additional capacity, so that you can burst to double your usual peak performance and still save 5/7th the costs of Amazon. If you're expecting to require the ability to scale much higher than 2x, you can also purchase a bunch of servers just for 1 month, or even less time depending on the provider, or take a hybrid approach if you really need an extreme amount of dynamic scalability. The reality though is that such extreme scaling is in the minority of cases.

I'm not sure how you arrived at the 150 machines number, but that sounds like 3-4 full cabinets and about right to have a reasonable scale for colo.

thezilch · on Feb 17, 2016

Note: I manage dedicated-server hosting and all web services on Titanfall @ Respawn.

Surely your calculation doesn't / can't include bandwidth charges? And if you have Windows hosts... cloud screws you too.

You're not wrong that it takes a lot more machines than most shops need, and if you can make your scale elastic, 150 machines probably goes a long way! And to your point, developers undervalue their time a lot, and dealing with colocation and sourcing of hardware and backups can be a real drain to save some on CapEx.

ChuckMcM · on Feb 17, 2016

Our transit costs out of our Colo in Santa Clara are $6K/month for two dedicated gigabit lines. That translates roughly to 518 TB of data transfer a month if we could keep the pipes fully lit 24/7. Cogent has offered to make one of our gigabit lines burstable to 10G if we would like, same cost if we don't exceed a monthly average use of 1000Mbps.

And we use Linux so no Windows hosts charges.

There are more and more tools that are force multipliers for your Site Reliability Engineer (SRE) equivalents. You can't avoid swapping out dead drives, but with the right systems architecture you can make it pretty painless.

rdl · on Feb 17, 2016

Wow, you're getting ripped off.

cbg0 · on Feb 17, 2016

Indeed, I've seen Cogent go for $800 - $1000 for a 1GE over 10GE commit.

hhw · on Feb 17, 2016

It's possible to only pay the lower end of that range for a 2Gb 90th percentile commit on a 10Gb connection out of most major markets. Cogent pricing fluctuates wildly depending on your ability to negotiate.

Ignore the whole song and dance when they have to bring in a VP into the call to specially approve the pricing they're trying to pitch to you at. Those are standard transit sales tactics. Or avoid Cogent's sales entirely by working with a reseller.

MichaelRenor · on Feb 17, 2016

Speaking with the VP to Negotiate for transit prices? Sounds like a pita. AWS knows that you're an engineer who just wants to write code and has made that as frictionless as possible.

This goes back to my first point in that going into the DC requires staff with very different skill sets. It's not that it's not worth it, but there's a lot of costs involved part from the per-hour instance cost.

NotSammyHagar · on Feb 17, 2016

Can you elaborate? Discussions with details about things like this are always interesting.

rdl · on Feb 17, 2016

He's paying 2-3x market (although 2Gbps isn't huge). Cogent is pretty much bargain-basement, too.

I wish the industry would stop quoting prices they way we do (USD/Mbps, so e.g. $1.20/Mbps); Gbps is probably the right unit now, and it may make more sense to invert.

But there are lots of terms involved in a transit contract; unless you're buying tens of gigs, crossconnect, port cost, commit, etc. may be more meaningful than per-Mbps cost. And pricing often depends greatly on exactly where in a city you're buying (at IX, carrier neutral facility, etc. will be cheapest; off net building or monopoly building will be most expensive) -- and of course Cogent transit vs., say, Level(3) transit are only superficially the same.

jpgvm · on Feb 17, 2016

Exactly, transit cost is the big killer.

AWS/GCE/Azure are all really expensive compared to even good multi-homed transit. If you push more than 10TB and you don't need multi-homing you are already ahead.

If you do need HA multi-homed transit you are looking more at 50-100TB or so but still, there are a lot of places that easily do that every month, especially if you are doing cross-DC storage replication for example.

abakker · on Feb 17, 2016

Yes, but a huge reason to go with Amazon is to be able to take money back off the table. Sure, if you have predictably 24/7 load, then you may be better off racking your own at 150 machines. BUT, if you're not sure how much you'll need, and your app is built to scale up and down with demand, then that 150 number starts to slide upwards.

Risk mitigation is part of the cost of infrastructure that needs to be factored in, and if your business has a downturn, or your compute needs turn out to be lower, then you can scale down your OPEX pretty quickly. If you had bought them, you'd be stuck with them. (See Zynga)

The operating cost of doing it yourself is always hard to pin down, and include many many things. OS/Hypervisor licenses, staff, hardware, cooling, redundancy, disaster recovery, power, or Colo fees etc. The advantage with AWS, is that this just a few numbers on a few various services and it is very predictable.

boomzilla · on Feb 17, 2016

Yes, I guess the 150 machine is more like a guideline. It depends very much on your workload. If you need 150 beefy machine running at 90%CPU all the time, then yes, running your colo may come out ahead. On the other hand, if you run a system with an online fleet of say 20 web servers, a couple of beefy databases, and some offline analytic workload, then AWS will definitely come out ahead.

And that is not considering other AWS offering. I've come to really like DynamoDB. It takes some time to get used to, but I've found it's solving more and more problems for me, without having to scale OPS engineers. There is a danger of lock-in, but I guess as I am a paying customer, it's not going away anytime soon, and Amazon is probably not jacking up the price either.

ciupicri · on Feb 17, 2016

Yet Netflix migrated to Amazon and I think that it has way more than 150 machines.

ChuckMcM · on Feb 17, 2016

Yes they did. And that is a really great exercise in thinking about what their profitability would be like if they were structured more like YouTube in terms of their server infrastructure.

But to the point of many other folks here, if you're all cloud you can really shrink fast if you need too. Lose a millions customers? No problem, just drop some machines. Any business that can price into the monthly customer fee the marginal monthly cost of AWS to support that customer, can keep their costs flat when their customer counts vary. Someone with dedicated hardware is still paying the bills for it, even when their customers leave.

boomzilla · on Feb 17, 2016

I am not sure I'd agree that Netflix could've been more profitable if they went with their own DC. I think Netflix does not utilize that much hardware. Their streaming is on their own CDN, which I heard that are placed directly in the big cable companies' DC. Tracking and storing user profiles and other meta data would not require a huge amount of hardware. I think Netflix also does quite a bit of video encoding and analytics, but these workloads would fit well with EC2.

ChuckMcM · on Feb 17, 2016

That would be the thing right? Clearly they can compute their marginal cost per subscriber, and they know what the lifetime value of a subscriber is, so they budget what to spend on EC2 infrastructure and movies at Sundance and still meet their net income goals. They would also know how much difference it would make if they needed to add or reduce their EC2 footprint.

Sadly I don't expect them to share what those costs are. I have always felt that Amazon could, in theory, always out compete with NetFlix with their Amazon Prime Video because they would only have to charge the marginal rate for the hardware. However to date it seems like the content contracts have kept NetFlix on top.

rhizome · on Feb 17, 2016

I have to think they're getting a deal.

vosper · on Feb 17, 2016

Plus, they're the number one poster-child for AWS. Netflix running on AWS makes AWS safe for CTOs everywhere. So, yeah, they're definitely getting a deal.

kevcampb · on Feb 17, 2016

Which should be expected when buying in bulk.

BorisMelnik · on Feb 17, 2016

that would be fun to calculate

Retric · on Feb 17, 2016

Bulk rates are always going to be unlisted and cheaper.

Also, having 150 machines in a single location is rarely optimal.

Sanddancer · on Feb 17, 2016

One can always do a hybrid approach, where baseline load is handled on purchased machines, but with renting AWS time on an as-needed basis. Also, ops staff is useful, and often has skillsets that augment development. Letting someone else worry about building daemons, tuning parameters, etc, means that your devs can spend more time actually writing code. Interrupting your devs because nginx has a memory leak can be a much bigger waste of time.

seanwilson · on Feb 17, 2016

> I hear this type of argument a lot, but it's so important to include the cost in engineering and talent necessary to keep your datacenter humming.

I'm surprised how often I see arguments like this as well. If it takes even a few days of time for your employees to manage the hardware and you're only saving a few $100 a year buying the hardware yourself, you're already making a loss.

kayoone · on Feb 17, 2016

Imo it is not that easy. Amazon AWS does not manage itself, so you most probably need a Dev Ops guy anyway to not waste the time of your developers with configuring environments and keeping them running. I would say that the hardware part of devops is way less when you rent boxes and basically run them stateless.

seanwilson · on Feb 17, 2016

I'm not saying it's easy but it's easier. Different cloud services let someone else take care of replacing faulty hardware, upgrading hardware, adding new servers, software updates, backups etc. It's very difficult and time consuming to replicate all this reliably with a couple of dev ops guys.

kayoone · on Feb 18, 2016

If you rent dedicated boxes you don't have to worry about replacing hardware or adding new servers. Also the cost savings are so huge, as someone else stated a dedicated box could be 1/7 the monthly price of an equally specced AWS instance you can easily have a 1-2 devops guys on call and still save a lot of money. I don't see where buying/colo makes sense though, maybe if you have a really huge operation going on. Hetzner offers a i7 Skylake QuadCore 6700, 64GB Ram, 512GB SSD Raid1 for 60 EUR/month, an equally specced AWS instance (or multiple smaller ones) would easily be >400 EUR even when using upfront payment and reserved instances. That difference is huge if we are talking about 10-50 of these machines. The only benefit i see is when you can scale your app up and down a lot during the day to save costs, but i'd still like to see a real world case where that actually saves money. I am pretty sure most companies use AWS for the convienence and just don't care about the cost because it's minor in the grand scheme of things, like super popular startups with huge funding and years to figure out and optimize profitability.

voltagex_ · on Feb 17, 2016

I wish I had an excuse to play with one of these - I wonder what their colo costs are. I suspect you could run quite a few build boxes for all sorts of FOSS projects with a system like this.

jjn2009 · on Feb 17, 2016

Its unfortunate they didn't give an more details on the per year costs in colo. I'm pretty interested in that metric as well and how it compares to the other intel xeon chips.

zaroth · on Feb 17, 2016

Colo cost varies based on the facility, uplink, power density, and various other factors. So it doesn't quite make sense to talk about the "colo cost" of a specific device, or comparing colo cost of one chip versus another... unless what you're really just comparing is power efficiency.

Let's say you're renting half a rack with 100Mbps unmetered bandwidth and 20amps of 120V for $500 per month. So you have 16 amps / 1920 watts of continuous power you can draw. These Xeon D's are very lower power, although strangely Pat doesn't report the actual idle and full load draw as measured by a Kill-o-Watt meter. With the 10GBE I can't imagine budgeting much less than 150 watts per machine. If you can put let's say a dozen of these in that half rack, in my contrived example, the "colo cost" would be $41.66 / month or just $2.60 / core.

Patrick-STH · on Feb 18, 2016

Hi zaroth - We do not use Kill-o-Watt meters since we use calibrated Extech TrueRMS units for our "desktop" testing.

We also have a few racks in Sunnyvale, California that we use APC metered by outlet PDUs for and that we actually calibrated by testing with the Extech units and found two of over a dozen that gave us consistent readings across PDUs and ports.

This unit is in a SC515 chassis with redundant PSUs. Even with 4x 7200 rpm Seagate 2.5" drives, a 64GB mSATA drive and a Samsung SM951 m.2 SSD it is still pulling 0.24A on 208V (so 0.48A on 208V). If you were to gain a bit of efficiency running a single PSU, you could easily fit these in 1A on 120V power envelope.

What we have seen several folks do with similar D-1540 machines is actually use 1U 1A/ 120V hosting (which is very inexpensive) to deliver cheap local PoPs for their applications. We have been strongly considering decommissioning our Las Vegas 1/2 cab DR site and moving to this sort of distributed model as there are a lot of benefits with this.

We published more complete benchmark results ( http://www.servethehome.com/intel-xeon-d-1587-benchmarks-16-... )of the D-1587 yesterday, expect more in terms of reviewing the overall platform next week. The Xeon D is very platform dependent for power and the particular unit we have does have the additional LSI/ Avago/ Broadcom SAS 2116 (16 port SAS2) HBA onboard.

zaroth · on Feb 18, 2016

Thanks Pat! Long-time reader of STH, I really find your articles incredibly useful and informative!

I assume you meant to write "(so .48A on 120V)". 50W idle with 4 spinning drives, an M.2, and the redundant PSU really is really very impressive. As you say, keeping full load under 120 watts to be able to fit in a 1U/1amp colo gives an incredible value.

The hardware's not even that expensive! And for $50 - $60 / month in colo cost [1], the reliability of systems now with SSDs, and the amazing orchestration tools that are available, the likes of AWS, DigitalOcean, etc. start to look really expensive.

[1] http://www.webhostingtalk.com/forumdisplay.php?f=131

Patrick-STH · on Feb 19, 2016

Awesome! Great to hear. You got me on the typo. Should have been 0.24A per PSU (we meter by outlet on multi-outlet systems). Still 1A on 120V in a 1U is very easy to achieve.

jjn2009 · on Feb 17, 2016

The Xeon D-1540 is running 73 watts under load according to anandtech http://www.anandtech.com/show/9185/intel-xeon-d-review-perfo...

Across all the SKUs in OP the TDP is 35-65 watts. 20 More watts for the new 1587 SKU compared to the benchmarked 1540 in the anandtech article. So a rough estimate would be maybe ~95 watts at under load? Assuming the board is the same and only the CPU has changed that should be a pretty good estimate.

You're right it mostly doesn't make sense to compare machines in that way due to all of the variables surrounding colo costs, max load wattage is the metric I should be interested in.

detaro · on Feb 17, 2016

They talk about their colo stuff quite a bit, with prices here and there inbetween:

http://www.servethehome.com/going-colo-series-part-1-decidin...

http://www.servethehome.com/colo-series-part-2-picking-coloc...

http://www.servethehome.com/colocation-years-learning-experi...