A more plausible explanation is that the xen networking path is simply expensive, the intel VFs are limited by queue count and silicon (i40e isn't a great ASIC), and the Annapurna part is really an ARM64 NPU. NPUs have been abandoned by most silicon vendors and have a tragic history. It's simply hard to make NPUs work right at attractive price/power/performance and at high speed versus fixed function scatter/gather I/O units coupled with general purpose CPUs running software network stacks. The only benefit Annapurna gives EC2 over a software device model is a hard security boundary of effectively another computer inside the computer for Nitro metal as a service. I think this is one reason why EC2 is limited to 25G while 100G has been commodity for a long time.
Here is a demonstration of a software stack that can scale toward hardware limits without relying on a particular vendor https://www.slideshare.net/SeanChittenden/freebsd-vpc-introd.... This approaches 100G line rate for large packets which is what it was optimized for. I don't know PPS at low packet size but do know what would be required to optimize that use case and it could be done pretty quickly.
I think this is one reason why EC2 is limited to 25G while 100G has been commodity for a long time.
Interestingly, the ENA driver has #defines for speeds up to 400 Gbps.
My guess as to why EC2 instances are limited to 25 Gbps is that it's a matter of balancing overprovisioning and the need to avoid having a single instance eat too much of a rack's bandwidth. I don't know how much bandwidth they have going to each rack, but there's a limit to how much it makes sense to provision; if typical bandwidth is on the order of 10 Gbps per rack (say, 80 instances pushing 125 Mbps on average) then you might want to provision 200 Gbps/rack and limit each instance to 25 Gbps rather than provisioning 1 Tbps/rack and limiting each instance to 100 Gbps.
(Numbers above are completely invented; I don't have any internal knowledge of how Amazon's networks or datacenters are set up.)
Most large operator datacenters are converging toward things like Clos and fat tree networks that provide abundant bandwidth at acceptable cost and with minimal blocking. Switch silicon vendors have really done yeoman's work pushing the envelope to make this possible and inexpensive. AWS might have such magnitude of machine count and generally low customer resource utilization that they can oversubscribe a lot, but it would be pretty silly to only bring in 200gbps to a rack post 2014 when the Broadcom Tomahawk switch ASIC became dominant.
I'm slightly confused as you are both talking about AWS Nitro and XEN. I know Nitro moved off of Xen and was roughly based on KVM.
Also, are you talking about Annapurna in it's pre-acquisition form or new one? AWS talks about new custom asics and multiple ARM SoCs on their Nitro system.
Agreed - I read this and saw XPS being the culprit writ large.
AWS aren’t alone in this, and actually do pretty darn well compared to their competition - we had a nightmarish time a few years back with exactly this with a VPS provider - half of every second the traffic to the memcached cluster would just stop. Turned out they’d set hard limits on packets/sec to avoid oversaturating the host, so the advertised Gbps interconnect was actually 50Mbps when you saturated the packet scheduler.
It's really about flows as well, not necessarily total throughput.
AWS Nitro allows 5G/bit per flow. And then maxes out at 25G/bit. I know GCP does something similar.
Also, pretty sure that is false regarding Azure, they have a small availability of Infiniband, but, that is not on their general compute platform and has a narrow use case/many restrictions. Azure has had the worst networking performance from my experience and only had 10GbE NICs (it's been a while though)
Sounds like a marketing blurb but from just a few days ago:
"Azure is breaking the speed barrier in cloud connectivity. ExpressRoute Direct provides 100G connectivity for customers with extreme bandwidth needs. This is 10x faster than other clouds."
It's not so much the size of the packets as it is having flows that can be vectored through the packet processing stack in batches. This is obviously easier to ensure as a sender and a receiver than something like a bump in the wire deep packet inspector unless it doesn't keep stateful data.
Amazon also limits DNS queries - probably in a well meaning attempt to prevent DNS amplification attacks from originating within AWS. And I mean DNS queries across their network whether or not they hit Amazon's DNS servers. This is _any_ port 53 UDP traffic.
This issue can easily get amplifier if you're using Kubernetes on AWS and some library that didn't cache on DNS on its own. Imagine you have a healthcheck every 3 seconds, do a bunch of DNS to its dependencies services, and a single server may have 10 pods.
Note that if your traffic hits the ec2 connection tracking security groups, you will also hit per instance limits on the number of tracked connections [1]. As far as I know, they don't come out and say they have a limit on the number of tracked connections, but they do, and it scales by instance type -- better to adjust your rules so the traffic is allowed in a stateless manner.
I don't know, but wouldn't be surprised if connection tracked packets are more limited than packets that aren't tracked.
That sure sounds like it's being processed by the standard Linux firewall. In which case, yeah, if you have (my favorite example) a web crawler operating on the general web, you'll hit serious limits.
There is a limit of you have a Security Group attached with a rule that is -not- 0.0.0.0/0. So for anything that is public / heavy utilized, the recommendation is to open the service up to 0.0.0.0/0.
The (undocumented) PPS limitation on EC2 instance types before they added SR-IOV NICs is around 150K PPS. If you had your own full machine — usually the top size of a given instance class, but no guarantee — this would be pretty consistent. But it was a shared resource. This made running memcache clusters really painful on EC2, given that they’d easily get limited by packet throughput before CPU or bandwidth. With modern instance types it’s much better!
I took him saying "just wrong" as being "im not OK with this", but i could be incorrect.
However, with that said, i have always found that calling your rep and asking about specific un documented limits is the fastest way to get to the bottom of per-instance/account/vpc/whatever limits.
Just as there are limits that can be changed if you agree, in writing, that you will be financially responsible for whatever the impact is (e.g. when you could tell them that you wanted spot price limits adjusted for you to be able to better bid above the scaling factors that were in place(not sure if this is still the case))
Some limits are global and cant be changed/negotiated, but other undocumented limits....
I guess technically it is, but I hesitate to call the need for PPS limitations in the DC as "over subscribing".
Connect a single server to a network and it's oversubscribed. That's a bit hyperbolic, but even some beefy networks can be seriously burdened by just a single server spamming UDP packets without some sort of QoS.. Especially if they are bypassing user space and using the kernel to just replicate a bunch of packets onto the wire :)
I'm probably a bit biased from having spent time setting up linux TC on xen hypervisors for this very reason; and I think we even settled at 50k pps for the per vm limit too..
It has to do with hardware support for virtualized networking. Instances capable of SR-IOV or with the ENA can sustain millions of PPS. It makes a huge difference.
Not correct. Instances are rate limited based on their type and protocol. Enhanced networking support increases total potential performance, but there are artificial limitations put in place.
They can sustain millions of PPS, but they still get artificially limited, no matter how many ENI's you seem to attach a given instance will top out at around the same amount of PPS
Eventually people will realize how AWS overcharges for what they deliver. But of course there's nothing wrong with pricing yourself higher than the lowest cost option...
This is perhaps already being seen in a piecemeal fashion as people compare eg S3 storage prices with other companies' prices.
EC2 throttles everything by default and PPS is no exception.
What can you do when your system needs more bandwidth, cpu, ram or any other kind of resources?
You can either scale vertically... which is not bad at beginning of a project, but sooner or later you will hit the ultimate limit.
Or
You can scale horizontally. Which means that you have enough nodes or instances to bypass those limits and make sure your projects grow well over time.
Netflix runs on EC2 and they probably generate billions of PPS from Amazon. For sure it's several millions PPS a d they seem to not hit the limits mentionned in the article.
What is it you think netflix runs on AWS? Content distribution is served from their Open Connect CDN, not AWS.. last I understood, most of Netflix cloud workloads were analytical/DWH, and services.. Not generally billions of PPS, and certainly not to single instances.
Here is a demonstration of a software stack that can scale toward hardware limits without relying on a particular vendor https://www.slideshare.net/SeanChittenden/freebsd-vpc-introd.... This approaches 100G line rate for large packets which is what it was optimized for. I don't know PPS at low packet size but do know what would be required to optimize that use case and it could be done pretty quickly.