Packets per Second Limitations in EC2

kev009 · on Oct 27, 2018

A more plausible explanation is that the xen networking path is simply expensive, the intel VFs are limited by queue count and silicon (i40e isn't a great ASIC), and the Annapurna part is really an ARM64 NPU. NPUs have been abandoned by most silicon vendors and have a tragic history. It's simply hard to make NPUs work right at attractive price/power/performance and at high speed versus fixed function scatter/gather I/O units coupled with general purpose CPUs running software network stacks. The only benefit Annapurna gives EC2 over a software device model is a hard security boundary of effectively another computer inside the computer for Nitro metal as a service. I think this is one reason why EC2 is limited to 25G while 100G has been commodity for a long time.

Here is a demonstration of a software stack that can scale toward hardware limits without relying on a particular vendor https://www.slideshare.net/SeanChittenden/freebsd-vpc-introd.... This approaches 100G line rate for large packets which is what it was optimized for. I don't know PPS at low packet size but do know what would be required to optimize that use case and it could be done pretty quickly.

cperciva · on Oct 27, 2018

I think this is one reason why EC2 is limited to 25G while 100G has been commodity for a long time.

Interestingly, the ENA driver has #defines for speeds up to 400 Gbps.

My guess as to why EC2 instances are limited to 25 Gbps is that it's a matter of balancing overprovisioning and the need to avoid having a single instance eat too much of a rack's bandwidth. I don't know how much bandwidth they have going to each rack, but there's a limit to how much it makes sense to provision; if typical bandwidth is on the order of 10 Gbps per rack (say, 80 instances pushing 125 Mbps on average) then you might want to provision 200 Gbps/rack and limit each instance to 25 Gbps rather than provisioning 1 Tbps/rack and limiting each instance to 100 Gbps.

(Numbers above are completely invented; I don't have any internal knowledge of how Amazon's networks or datacenters are set up.)

kev009 · on Oct 29, 2018

Most large operator datacenters are converging toward things like Clos and fat tree networks that provide abundant bandwidth at acceptable cost and with minimal blocking. Switch silicon vendors have really done yeoman's work pushing the envelope to make this possible and inexpensive. AWS might have such magnitude of machine count and generally low customer resource utilization that they can oversubscribe a lot, but it would be pretty silly to only bring in 200gbps to a rack post 2014 when the Broadcom Tomahawk switch ASIC became dominant.

rbranson · on Oct 27, 2018

James Hamilton talked about their commitment to 25GbE hardware at Reinvent in 2016. Fast forward to ~23m. https://youtu.be/AyOAjFNPAbA

p0rkbelly · on Oct 27, 2018

I'm slightly confused as you are both talking about AWS Nitro and XEN. I know Nitro moved off of Xen and was roughly based on KVM.

Also, are you talking about Annapurna in it's pre-acquisition form or new one? AWS talks about new custom asics and multiple ARM SoCs on their Nitro system.

kev009 · on Oct 29, 2018

The comment is quite clear, there are three networking technologies in use at amazon. Nitro was never xen, Nitro is KVM with Annapurna add in cards.

madaxe_again · on Oct 27, 2018

Agreed - I read this and saw XPS being the culprit writ large.

AWS aren’t alone in this, and actually do pretty darn well compared to their competition - we had a nightmarish time a few years back with exactly this with a VPS provider - half of every second the traffic to the memcached cluster would just stop. Turned out they’d set hard limits on packets/sec to avoid oversaturating the host, so the advertised Gbps interconnect was actually 50Mbps when you saturated the packet scheduler.

discodave · on Oct 27, 2018

Interesting theories on the EC2/Annapurna situation.

Do GCP, Azure, or any other cloud providers offer 100G networking?

virtuallynathan · on Oct 27, 2018

Not to the instance, AFAIK. Google Cloud maxes out at ~20Gbps, and I think Azure does ~40Gbps.

p0rkbelly · on Oct 27, 2018

It's really about flows as well, not necessarily total throughput.

AWS Nitro allows 5G/bit per flow. And then maxes out at 25G/bit. I know GCP does something similar.

Also, pretty sure that is false regarding Azure, they have a small availability of Infiniband, but, that is not on their general compute platform and has a narrow use case/many restrictions. Azure has had the worst networking performance from my experience and only had 10GbE NICs (it's been a while though)

gstaro · on Oct 27, 2018

Sounds like a marketing blurb but from just a few days ago:

"Azure is breaking the speed barrier in cloud connectivity. ExpressRoute Direct provides 100G connectivity for customers with extreme bandwidth needs. This is 10x faster than other clouds."

https://azure.microsoft.com/en-us/blog/azure-networking-fall...

Can sb confirm that? Have a useful case in mind.

cthalupa · on Oct 27, 2018

I do not believe ExpressRoute is instance level, so not directly relevant to this discussion.

emmericp · on Oct 27, 2018

100G line rate with large packets is only 8 Mpps, that's only ~5G with 64 byte packets.

kev009 · on Oct 29, 2018

It's not so much the size of the packets as it is having flows that can be vectored through the packet processing stack in batches. This is obviously easier to ensure as a sender and a receiver than something like a bump in the wire deep packet inspector unless it doesn't keep stateful data.

ttul · on Oct 27, 2018

Amazon also limits DNS queries - probably in a well meaning attempt to prevent DNS amplification attacks from originating within AWS. And I mean DNS queries across their network whether or not they hit Amazon's DNS servers. This is _any_ port 53 UDP traffic.

https://www.sparkpost.com/blog/undocumented-limit-dns-aws/

kureikain · on Oct 28, 2018

I have got hit by this several time :(. AWS actually have well document about this DNS issue: https://docs.aws.amazon.com/vpc/latest/userguide/vpc-dns.htm...

This issue can easily get amplifier if you're using Kubernetes on AWS and some library that didn't cache on DNS on its own. Imagine you have a healthcheck every 3 seconds, do a bunch of DNS to its dependencies services, and a single server may have 10 pods.

wumpus · on Oct 27, 2018

I wonder if this is related to connection tracking?

john37386 · on Oct 27, 2018

By default, Amazon uses stateless firewall. It means that by default it's not tracking connections.

spydum · on Oct 27, 2018

I think you may be mistaken. Security Groups are stateful: https://docs.aws.amazon.com/vpc/latest/userguide/VPC_Securit...

As suggested, it's very likely they hit the connection tracking limitation: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-ne...

I've personally witnessed teams hit this specifically for DNS (usually for internal, where you have explicitly permitted src/dst).

john37386 · on Oct 27, 2018

Yep I mistaken Security groups with Network Acl. Thanks. It's in the best practices to not track dns for big systems.

From powerdns https://doc.powerdns.com/recursor/performance.html

gstaro · on Oct 27, 2018

Thx for the clarification. Had the same misconception

toast0 · on Oct 27, 2018

Note that if your traffic hits the ec2 connection tracking security groups, you will also hit per instance limits on the number of tracked connections [1]. As far as I know, they don't come out and say they have a limit on the number of tracked connections, but they do, and it scales by instance type -- better to adjust your rules so the traffic is allowed in a stateless manner.

I don't know, but wouldn't be surprised if connection tracked packets are more limited than packets that aren't tracked.

[1] https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-ne...

wumpus · on Oct 27, 2018

That sure sounds like it's being processed by the standard Linux firewall. In which case, yeah, if you have (my favorite example) a web crawler operating on the general web, you'll hit serious limits.

toredash · on Oct 27, 2018

There is a limit of you have a Security Group attached with a rule that is -not- 0.0.0.0/0. So for anything that is public / heavy utilized, the recommendation is to open the service up to 0.0.0.0/0.

rbranson · on Oct 27, 2018

The (undocumented) PPS limitation on EC2 instance types before they added SR-IOV NICs is around 150K PPS. If you had your own full machine — usually the top size of a given instance class, but no guarantee — this would be pretty consistent. But it was a shared resource. This made running memcache clusters really painful on EC2, given that they’d easily get limited by packet throughput before CPU or bandwidth. With modern instance types it’s much better!

ra1n85 · on Oct 27, 2018

????

Where are you getting these numbers? For what instance types? For what protocols? This is just wrong.

edoceo · on Oct 27, 2018

GP said it was undocumented (I presume empirical, but would like more details)

Do you have a more accurate dataset? What are your observations?

I'm assuming this affects loads of HN readers and I too am interested in what the facts are.

Side note: you may be getting downvoted because a source, or other details are lacking. I too get downvoted for posts that lack these details.

samstave · on Oct 27, 2018

I took him saying "just wrong" as being "im not OK with this", but i could be incorrect.

However, with that said, i have always found that calling your rep and asking about specific un documented limits is the fastest way to get to the bottom of per-instance/account/vpc/whatever limits.

Just as there are limits that can be changed if you agree, in writing, that you will be financially responsible for whatever the impact is (e.g. when you could tell them that you wanted spot price limits adjusted for you to be able to better bid above the scaling factors that were in place(not sure if this is still the case))

Some limits are global and cant be changed/negotiated, but other undocumented limits....

Rapzid · on Oct 27, 2018

I guess technically it is, but I hesitate to call the need for PPS limitations in the DC as "over subscribing".

Connect a single server to a network and it's oversubscribed. That's a bit hyperbolic, but even some beefy networks can be seriously burdened by just a single server spamming UDP packets without some sort of QoS.. Especially if they are bypassing user space and using the kernel to just replicate a bunch of packets onto the wire :)

I'm probably a bit biased from having spent time setting up linux TC on xen hypervisors for this very reason; and I think we even settled at 50k pps for the per vm limit too..

blazespin · on Oct 27, 2018

They should still publish the limits. If they are reasonable, customers won't mind.

RA_Fisher · on Oct 27, 2018

The distributions are mixtures. This is my favorite package for modeling those: https://cran.r-project.org/web/packages/gamlss.mx/index.html

ramshanker · on Oct 27, 2018

One thing I was looking for the entire article weather these limitations are imposed by the Instance or the TOR networking gear.

Maybe random reallocation to a rack with newer switches will guarantee better baseline PPS.

rbranson · on Oct 27, 2018

It has to do with hardware support for virtualized networking. Instances capable of SR-IOV or with the ENA can sustain millions of PPS. It makes a huge difference.

ra1n85 · on Oct 27, 2018

Not correct. Instances are rate limited based on their type and protocol. Enhanced networking support increases total potential performance, but there are artificial limitations put in place.

dkhenry · on Oct 27, 2018

They can sustain millions of PPS, but they still get artificially limited, no matter how many ENI's you seem to attach a given instance will top out at around the same amount of PPS

ra1n85 · on Oct 27, 2018

Limits are done on the physical host or the host’s NIC. The TORs are not involved.

shaklee3 · on Oct 27, 2018

Does anyone know if this applies to the dpdk ENA driver?

Gcplp · on Oct 27, 2018

My guess is that there's a bridge between the NIC and the VM though which they're imposing the limits.

Could be OVS with DPDK support, for example.

dgemm · on Oct 27, 2018

What difference would the driver make?

shaklee3 · on Oct 27, 2018

I'm assuming the limitation is in the smartnic, and perhaps a different driver is tuned differently.

patrickg_zill · on Oct 27, 2018

Eventually people will realize how AWS overcharges for what they deliver. But of course there's nothing wrong with pricing yourself higher than the lowest cost option...

This is perhaps already being seen in a piecemeal fashion as people compare eg S3 storage prices with other companies' prices.

john37386 · on Oct 27, 2018

EC2 throttles everything by default and PPS is no exception.

What can you do when your system needs more bandwidth, cpu, ram or any other kind of resources?

You can either scale vertically... which is not bad at beginning of a project, but sooner or later you will hit the ultimate limit.

Or

You can scale horizontally. Which means that you have enough nodes or instances to bypass those limits and make sure your projects grow well over time.

Netflix runs on EC2 and they probably generate billions of PPS from Amazon. For sure it's several millions PPS a d they seem to not hit the limits mentionned in the article.

spydum · on Oct 27, 2018

What is it you think netflix runs on AWS? Content distribution is served from their Open Connect CDN, not AWS.. last I understood, most of Netflix cloud workloads were analytical/DWH, and services.. Not generally billions of PPS, and certainly not to single instances.