Feeding data to 1000 CPUs – comparison of S3, Google, Azure storage

oavdeev · on Jan 5, 2016

Stock Ubuntu needs SR-IOV driver to get to the actual bandwidth limit on ec2, it makes a lot of difference. We routinely get to ~2 Gbps down from S3 with that setup (using largest instance types).

edit: Gbps not GBps

rmcpherson · on Jan 6, 2016

That's true, although the latest stock ubuntu HVM AMIs (14+, I believe) have the SR-IOV driver already and use it by default. Older AMIs need to have it installed and enabled on the AMI. I believe enhanced networking is only available on HVM amis.

oavdeev · on Jan 6, 2016

This problem definitely existed with official 14.04 (HVM) AMI, though I haven't re-tested this recently, they may have fixed it. It did have some kind of SR-IOV driver but it was too old.

hrez · on Jan 6, 2016

Good point for "enchanced networking" instances. I didn't see OS specified in the article. AMZN linux would have SR-IOV driver by default. PV vs HVM might also have an impact.

zbjornson · on Jan 6, 2016

Per the comment here [1] and the linked twitter convo, I'll retest S3 with Amazon Linux soon. These tests used Ubuntu 14.04 on all providers, and did use HVM. My understanding is that this will possibly increase the network throughput of the VM, but the benchmarks stayed below the VM's capacity (which was the reason I included the charts of VM throughput).

[1] https://news.ycombinator.com/item?id=10846497

hiyer · on Jan 6, 2016

Couple of other points:

1. Enhanced Networking (SRIOV) only works in a VPC and not in EC2-Classic.

2. I think the 4x instances don't support 10Gb ethernet. If that is the case, it would also be instructive to test the 8x instances on S3.

For some very application (Hadoop) specific tests of Enhanced Networking, please take a look at https://www.qubole.com/blog/product/hadoop-enhanced-networki...

lowbloodsugar · on Jan 5, 2016

If you are pulling large files from S3 we have found that they can be sped up by requesting multiple ranges simultaneously. It is easy to hit 5Gb/s or 10Gb/s on instances with the necessary bandwidth, accessing a single file, or multiple files. We have not encountered a limit on S3 itself. YMMV.

hrez · on Jan 5, 2016

Excellent https://github.com/rlmcpherson/s3gof3r is my tool of choice for "fast, parallelized, pipelined streaming access to Amazon S3."

If you want to saturate network bandwidth with S3 that's the one tool I know that can do it.

jedberg · on Jan 5, 2016

AWS has a limit on the total throughput any one account can have to S3, so the more CPUs OP adds, the worse OPs performance will be on each one. I suspect the other providers have the same restriction.

I either missed it or OP didn't specify how many instances they was using at once to run their benchmark, but the more instances they used, the worse it will be per node.

This did not seem to be accounted for.

EDIT: OP says below it was from one instance, so what I said doesn't apply to this writeup.

BrandonY · on Jan 5, 2016

This is not the case with Google Cloud Storage. I cannot speak to the other providers.

Google Cloud Storage does not limit read or write throughput with the exception of our "Nearline" product (and even Nearline's limiting can be suspended for additional cost, a feature called "On-Demand I/O").

jedberg · on Jan 5, 2016

That's good to know, and definitely adds credence to my opinion that networking is the area where Google is definitely winning the Cloud Wars(tm)

zbjornson · on Jan 5, 2016

All the benchmarks were from a single instance.

(Note that I have done some testing from AWS Lambda, where we had 1k lambda jobs all pulling down files from S3 at once. That's a bit harder to benchmark...)

jedberg · on Jan 5, 2016

Hi OP, nice writeup! I hope my comment wasn't construed as dismissing the work, just a criticism of one small part.

It sounds like that wouldn't have been a factor, except for the cap you seem to have discovered on Amazon that you called out.

My only suggestion then is you may want to make it explicit that you ran the benchmarks from a single instance.

zbjornson · on Jan 5, 2016

Thanks! Not at all, it's a great point and something I didn't realize would play into the equation.

hrez · on Jan 6, 2016

Any comments on how it worked out with Lambda?

zbjornson · on Jan 6, 2016

Reluctant to say much because the benchmarks weren't formal. However...

The throughput correlated directly with how much RAM we allocated to the Lambda function (which presumably means we were sharing the VM with fewer other jobs).

512 MB RAM, 19.5 MB/s

768 MB RAM, 29.8 MB/s

1024 MB RAM, 38.4 MB/s

1536 MB RAM, 43.7 MB/s

Note that this also used the node.js AWS SDK, which is slower to download files than some other APIs.

hrez · on Jan 6, 2016

Thanks. I'd guess bigger RAM uses bigger instance types as a host hence more bandwidth. If this was my goal I'd try gof3r to stream data from s3.

colechristensen · on Jan 5, 2016

Do you have any sources or more information about the per-account S3 limits?

jedberg · on Jan 5, 2016

I don't have any published sources, it's something they told me, but it's hinted at here: http://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-...

They explicitly mention the RPS per account limit in that doc, which is related.

toomuchtodo · on Jan 5, 2016

RPS to S3 is limited, but not throughput to S3, except by bucket. Higher throughput can be achieved by sharding your data across multiple buckets. Also, its important to properly namespace your keys within buckets to ensure its efficiently distributed across underlying data partitions.

jedberg · on Jan 5, 2016

Unless that is a semi-recent change, that is not what I've been explicitly told. To be fair my information is at least two years old now.

toomuchtodo · on Jan 5, 2016

My experience is solely based on recent production workloads attempting to pull TBs of data out of S3 very quickly to restore data to less than reliable indexed datastore. YMMV.

lowbloodsugar · on Jan 5, 2016

Can you quote the piece where they mention RPS per account limit because I cannot find it.

jedberg · on Jan 5, 2016

> However, if you expect a rapid increase in the request rate for a bucket to more than 300 PUT/LIST/DELETE requests per second or more than 800 GET requests per second, we recommend that you open a support case to prepare for the workload and avoid any temporary limits on your request rate.

You have to know how to read their docs. :) This is basically code for, "there is a default limit here that you have to get raised if you want to go above it".

lowbloodsugar · on Jan 5, 2016

The full quote is:

>Amazon S3 scales to support very high request rates. If your request rate grows steadily, Amazon S3 automatically partitions your buckets as needed to support higher request rates. However, if you expect a rapid increase in the request rate for a bucket to more than 300 PUT/LIST/DELETE requests per second or more than 800 GET requests per second, we recommend that you open a support case to prepare for the workload and avoid any temporary limits on your request rate. To open a support case, go to Contact Us.

So this looks like an auto scaling issue. It states "S3 automatically scales to support higher request rates". However, if we know that a bucket is going to need to scale dramatically, we can request, in advance, that the S3 team pre-scales it.

I'm sure there is an account limit, but to run 1000 cpu's already requires requesting an increase in the account's EC2 instance limit. Are you saying that a team trying to access 150Gb of files, or to make 1000 RPS, as the article documents, will hit that limit? From your experience, how big is this hard limit? Is it Netflix scale or is it GB or TB?

vtuulos · on Jan 6, 2016

We are routinely pulling a dataset of hundreds of GBs to 100+ instances (1600+ cores) in parallel. We have never noticed throughput going down with the number of nodes. S3 delivers the maximum throughput of 2-4Gbps / instance very consistently.

nostromo · on Jan 5, 2016

Take into account OP's former jobs. I imagine if anyone would run into such a limit, it would be Reddit or Netflix.

lowbloodsugar · on Jan 5, 2016

If such a limit exists, it would not have been hit on such a small benchmark. However, I am unaware of any such limit and it has never been raised in any discussion I have had with them. I am responsible for a large compute and data storage platform backed by S3.

Is this a limit that is hit anywhere near the 150GB discussed in this article, or is it something that you hit only if you are Netflix? We have TB in S3 and have not observed any limit other than EC2 instance bandwidth.

jedberg · on Jan 5, 2016

The amount of data one has in S3 isn't really relevant to the discussion, only how quickly you're trying to pull it into your instances.

lowbloodsugar · on Jan 5, 2016

Ok then let me rephrase: Is this a limit that is hit anywhere near the 603GB/s figure in this article, or is it something that you hit only if you are Netflix? You seem to be claiming that such a limit exists and that you know what it is. Can you share or is this NDA territory?

ChuckMcM · on Jan 5, 2016

When I see things like "data set size 150GB" and "1000 CPUS" I just naturally assume they are all in memory and never come from disk :-)

zbjornson · on Jan 5, 2016

That's one of many data sets on the server, so unfortunately we can't keep them all in memory at once. :(

ChuckMcM · on Jan 6, 2016

Lets assume when you're saying "cpu" when you mean "core" and your typical server class machine has 24 of those. A 1000 "cpus" is 41 machines, if they each donate 32GB to the cause[1] that is 1.3TB worth of data which is only a few microseconds away from any core.

I'm not sure why anyone would build a server with less than 96GB on it these days, so its not at all unreasonable. Now your service provider my jerk you around but you can run two racks of machines (48 machines) in a data center with specs like that for about $25K/month (including dual gigabit network pipes to your favorite IP transit provider) So it isn't even all that huge of an investment.

[1] Consider your typical 'memcached' type service where data is named as a function of IP and offset.

jacquesm · on Jan 5, 2016

I think that data set is too small to constitute a good benchmark for the setup.

JoachimSchipper · on Jan 5, 2016

You're not wrong, but apparently such a short burst is what they're actually doing in their application.

ranrub · on Jan 5, 2016

with kernel tuning, S3 performance improves (and will probably improve on GC/Azure as well). Also, author uses Ubuntu 14.4 (see https://twitter.com/Zbjorn/status/684492084422688768), which doesn't use AWS "Enhanced networking" by default. Would be interesting to see results for tuned systems.

skywhopper · on Jan 6, 2016

Very interesting comparison, glad to see it. I don't have a comment on the content itself but I do have a note on the presentation.

The colors used for S3 and Azure Storage in the graphs are very near indistiguishable to me, as I have moderate red-green colorblindness. It's easier to tell apart on the bar graphs, since the patches of color are much larger, although I still have to work at it, and use the hints of the labels, but on the line graphs, it's basically impossible to tell apart. A darker shade of green would solve the problem for me personally, but I'm not all that bad a case, nor an expert on the best shades to pick for general color-blindness accessibility.

Just something to think about when presenting data like this.

mistermann · on Jan 6, 2016

Color blind here as well, I had to zoom in incredibly close to distinguish the difference.

zbjornson · on Jan 6, 2016

Thanks for pointing this out, and my apologies! Will fix that going forward.

jen20 · on Jan 6, 2016

Has the author (if they are reading here) considered using Joyent's Manta to take the processing to the data instead?

vgt · on Jan 6, 2016

There are plenty of architectures that do exactly this. EMR-on-S3, Google Dataproc on GCS, Snowflake-on-S3, BigQuery-on-GCS, etc etc.

The bigger point in the article is that these exact "take processing to the data" architectures operate exceedingly well on S3, GCS, Azure.

And, as a biased observer, these architectures operate on GCS the best due to great performance measured in the article, quick VM standup times, low VM prices, and per-minute billing.

zbjornson · on Jan 6, 2016

I'm still trying to parse the docs and Manta source code to see what it actually does, but it seems unique if the data storage nodes are also the data processing nodes and no data transfer happens from some storage service before the job begins. The other key factor is having neither startup time nor the cost of a perpetually running cluster. Per my comment below [1], we have used Lambda with S3 to get something like this, as well as our own architecture built on plain EC2/GCE nodes.

[1] https://news.ycombinator.com/item?id=10846514

qaq · on Jan 6, 2016

Not only that but the thing is built by guys who really know what they are doing like Bryan Cantrill and other former SUN top people.

vgt · on Jan 6, 2016

got it. thanks!

justinsaccount · on Jan 6, 2016

As you sure you understand what "take the processing to the data" means?

EMR-on-S3 is the "copy the data to the processing nodes" variety.

linc01n · on Jan 6, 2016

I think Manta is better if the result set is smaller than input set. So network performance won't matter that much. And also a per second pricing is better since the author need the result in 10 seconds.

Spinning up a cluster of VMs and use 10 seconds and they charge you min. 1 hour seems expensive to me.

dharbin · on Jan 6, 2016

I don't know about Manta, but this is the entire point of HDFS. It easier to move code than data.

zeristor · on Jan 6, 2016

Indeed, but they're having such fun. Let's leave them be.

zbjornson · on Jan 6, 2016

Hadn't heard of it, looks cool. Thanks for the tip :)

rmcpherson · on Jan 6, 2016

In S3 tests on c3.8xlarge instances, I've seen 8 Gbps throughput on both uploads and downloads using parallelized requests. Testing with iperf between two of the same instances maxed out about 8 Gbps as well so the throughput limitation is likely EC2 networking rather than S3.

These tests were done over a year ago so bandwidth limitations on EC2 may have changed since.

This testing was with https://github.com/rlmcpherson/s3gof3r

zbjornson · on Jan 6, 2016

That's really cool. Wonder if the same technique (parallel streams) would help for Azure and GCS. I know GCS has some built-in capabilities for composite uploads/downloads, which might achieve a similar effect.

imperialdrive · on Jan 5, 2016

Thanks for sharing your research - I've been up to the neck in EC2 migrations and trying to benchmark as I go... S3 is the neck chunk of work. Rock on!

hrez · on Jan 6, 2016

What missing from description is network setup. Is it ec2 classic, VPC? Is ec2 getting to s3 through IG? Hopefully not through NAT. There is also VPC endpoint to s3. Which all may have different performance profiles especially with multiple instances.

zbjornson · on Jan 6, 2016

Network was VPC. The EC2 instance had an IG attached, yes, but I'm not sure if you're asking if an internal vs. external URL for S3 was used? Are you saying there's a better endpoint than s3-<region>.amazonaws.com for S3 requests from EC2?

hrez · on Jan 6, 2016

I meant http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/vpc-en...

It's a private connection to AWS services including S3. You'd use the same URL as it's a routing basically. No idea if VPC endpoints would be better than IG though. P.S. Just tested and I get about half of the latency on VPC endpoint.

zbjornson · on Jan 6, 2016

Neat, did not know about that. Will add it to my follow-up benchmarks. Thanks for all the comments. :)

dwelch2344 · on Jan 5, 2016

I'd be interested to see how AWS' Elastic File System (EFS) compares (though I'd imagine it's not great, given it's mounted via NFS)

jaytaylor · on Jan 5, 2016

No hard numbers for you, but FWIW I ran tests about 4 months ago and the performance was /very/ low compared to what is achievable compared to S3 and even normal NAS.

zbjornson · on Jan 5, 2016

I've been on the list to get into their preview program for a while so I can benchmark it, actually! Part 3 of the blog post is going to include some NFS stuff either way.

acdha · on Jan 6, 2016

When you do, it would be really useful to include the classic fio/bonnie/etc. stuff to break down performance by the type of operation (e.g. file creation / deletion, streaming read/write, random read/write) and block size.

EFS supports NFSv4 so it should avoid being as routinely limited by server round-trip latency as NFSv3 tends to be but it'd be nice to see how well that works in practice.

frik · on Jan 6, 2016

How reliable is Azure? For example the story of Gitlab on Azure was a disaster: https://news.ycombinator.com/item?id=10781263 Something like that wouldn't happen on AWS, GC, Softlayer, etc.

qaq · on Jan 6, 2016

WTF would one deploy such thing in the cloud?

cottonseed · on Jan 6, 2016

Because renting 1000 cores for a limited time is much cheaper than buying them outright?

qaq · on Jan 6, 2016

1000 cores of what ? Vcore is marketing BS. Even if it was not marketing BS it's 28 2U 3 node boxes (if using older cpus) or 14 2U 3 node boxes (if using more recent ones) unless they have extremely spiky workload using AWS is pointless. Bandwidth bound scientific apps ==> use infiniband cluster.

cottonseed · on Jan 6, 2016

The OP is talking about running $0.027 worth of computation (1000 cores for 10s at 0.01/core/hr) and you think he should spend tens of thousands on hardware?

I'm not doubting a custom build will give him much greater bandwidth. I just doubt the workload has to be "extremely" spiky to make the cloud cost-effective.

Of course, he's going to get billed for 10m or 1hr minimum (Google or Amazon), so that's assuming he can amortize his startup across multiple jobs.

semi-extrinsic · on Jan 6, 2016

The big question is, why does it need to run in 10s? The main reason I can see is to be able to run this analysis very frequently, but then your workload is approaching constant.

The total amount of data is 150 GB; that would easily fit into memory on a single powerful 2-socket server with 20 cores and would then run in less than 15 minutes. The hardware required to do that will cost you ~ $6000 from Dell; assuming a system lifetime of five years and assuming (like you do) that you can amortize across multiple jobs, the cost is roughly the same as from the cloud, about $0.036 per analysis.

I'm fairly certain that, in the end, it's not more expensive for the customer to just buy a server to run the analysis on.

Edit: I see OP says 80% of the time is spent reading data into memory, at about 100 MB/s. Add $500 worth of SSD to the example server I outlined, and we can cut the application runtime by >70%, making the dedicated hardware significantly cheaper.

qaq · on Jan 6, 2016

Vcore is hyperthread of unknown CPU. So in reality 1000 vcores is 500 real cores. - All the overheads it's more like 450 given the low utilization until dataset loads to keep it at 10 sec you would need 90 real cores or 4 X 3 node dual boxes (ebay 1.5K each) and 2 X infiniband switches (ebay 2X300). For 6600 you have a dedicated solution with no latency bubbles fixed low cost.

zbjornson · on Jan 6, 2016

Briefly... We have many data sets, and the <10sec calculations happen every few seconds for every data set in active use. Caching results is rarely helpful in our case because the number of possible results is immense. The back end drives an interactive/real-time experience for the user, so we need the speed. Our loads are somewhat spikey; overnight in US time zones we're very quiet, and during daytime we can use more than 1k vCPUs.

We've considered a few kinds of platforms (AWS spot fleet/GCE autoscaled preemptible VMs, AWS Lambda, bare metal hosting, even Beowulf clusters), and while bare metal has its benefits as you've pointed out, at our current stage it doesn't make sense for us financially.

I omitted from the blog post that we don't rely exclusively on object storage services because its performance is relatively low. We cache files on compute nodes so we avoid that "80% of time is spent reading data" a lot of the time.

(Re: Netflix, in qaq's other comment, I don't have a hard number for this, but I thought a typical AWS data center is only under 20-30% load at any given time.)

qaq · on Jan 6, 2016

They have a single client running a single 10 sec job in a day? They plan to continue having a single client running a single 10 sec job in a day? The workload does have to be spiky to make the cloud cost-effective. There are workloads which are not appropriate for AWS. For any serious client AWS is a bad idea simply because there is single tenant (Netflix) consuming such a high percentage of resource that if they make a mistake causing a 40-50% increase in their load everyone gets f#$%ed.

slagfart · on Jan 6, 2016

You're hypothesising about something that has never happened. Check out some 3rd party cloud uptime metrics - the major providers (AWS, Google, Azure) have had less than an hour of downtime in the past year. Reliability is no longer on the agenda - it has been proven.

qaq · on Jan 6, 2016

It did happen and my clients were affected. After AWS fu$%up rollout of software update in 2011 that overvelmed their control plane and had whole zones down for many hours and took many days to fully restore, they rolled out patches that throttle cross zone migration. After those patches at one point Netflix was having issues and started massive migration that hit throttle thresholds and affected ability of other tenants to move to non affected zones. It's very far from hypothetical given netflix consumes about 30% of resources (which translates to many whole size physical datacenters) if they spike 50% they will overvelm the spare capacity.

gtaylor · on Jan 6, 2016

I spun up something like 200 "cores" to archive a large Cassandra cluster to Google Storage (Kubernetes cluster plus 200+ containers running the archive worker). Could have gone much bigger to get it done faster, but it wasn't necessary. ETL or archive jobs would be the most common case, to answer your question.