Footgun Prevention with AWS VPC Subnetting and Addressing

nodesocket · on May 29, 2017

Google Compute Engine makes VPC networking so much easier than AWS. They automatically create solid defaults with a /20 subnet per region (4096 addresses) which can all communicate with each other privately. No need to setup VPN's for cross region private communication or deal with NAT instances or internet gateways. Private traffic is always private. If you want to expose a VM just attach an IP address (static or ephemeral).

Also a cool trick is GCE instance names automatically get mapped in your VPC DNS. If you create an instance named web1, all other instances should be able to ping web1 out of the box with no additional configuration.

I can't stress enough how much better GCE is than AWS in terms of sane defaults, user experience, performance and cost.

skywhopper · on May 29, 2017

GCE does have a huge benefit in having been built several years after AWS organically grew into a set of services that now in retrospect seem to have obvious defaults. From my very limited experience with GCE, it seems to have its own oddities that make managing it from an organizational level weirdly difficult, but I can't argue about the details in a well-informed manner. I will say that while the lack of dead-stupid cross-region VPC peering is annoying in AWS, having talked to some of their engineers about the issue, it seems that their reluctance to make global features dead-simple stems from hard experience of their own in the risks of global interdependencies and the failure conditions that can result.

But to your specific details, I'm curious about the "private traffic stays private" and "no NATs required". So does GCE provide implicit NAT, or does it just work how AWS public subnets work (ie, give it an IP and it's on the Internet, don't give it one and it isn't)?

xref · on May 29, 2017

From my experience NATs and gateways are still required on GCP if you want your private subnet servers to be able to, say, yum update

user5994461 · on May 29, 2017

>>> having talked to some of their engineers about the issue, it seems that their reluctance to make global features dead-simple stems from hard experience of their own in the risks of global interdependencies and the failure conditions that can result.

Poor excuses. "It's too hard".

It is very hard to make global services. Google was the first to beat the technology long before every one else. Most of their cloud services are worldwide. They proved that it's possible.

Having spent much time in AWS. It seems the technology that runs it is exclusively mono region. All regions are segregated and independent, like a full AWS clone. There are no service that span regions.

AWS is simply not global. IMO it hints that AWS does not have the technology.

count · on May 29, 2017

1) Mono-region is a feature not a bug. Regulatory requirements, citizenship requirements, the ability to not have a single global point of failure, etc. are all huge benefits to the regional breakdown as it stands today. Nothing prevents you from deploying multi-region.

2) IAM is effectively global. Route53 is effectively global. AWS 'has the technology', they've just chosen a different engineering stance than Google, and are very up front about that.

user5994461 · on May 29, 2017

1) Locality requirements should be fulfilled by attaching or limiting workloads to regions. Not by having the entire infrastructure and everything only ever exist in a single place (which breaks numerous legal requirements by the way).

2) ELB/ALB are not. S3 is not. AMI only exist in a single region. EBS cannot be moved, ever. Billing is not really.

AWS is heavily region centric. For instance, it is impossible to list all instances attached to your account.

In command line tools and the web UI, it asks you a region at startup, and you will only see instances in that region.

count · on May 30, 2017

1) They'd have to figure out how to constrain their service teams who support. It's not just customer locality that's important.

2) Yep, agreed there. Again, the choice that was made :)

oblio · on May 29, 2017

The real question is: is there enough market demand for global services? I know on paper, yeah, there is, but how many can pay for it?

And even if they're big enough to pay for it, how many are willing to put their core services (if you're offering global services, they're most likely your core services) in the same basket... controlled by a huge company who might be one of your competitors? I guess many of the companies building a service on the global scale would want to have that expertise in-house.

user5994461 · on May 29, 2017

>>> The real question is: is there enough market demand for global services? I know on paper, yeah, there is, but how many can pay for it?

Yes. The world is full of international companies and services that span across the globe. They have money to pay whatever they want.

The issue, so far, has only been the tech.

Let's say one makes a random app coupled with a database, say mysql and php => mysql doesn't work across datacenters, It's not going worldwide. End of story.

There are very few systems than can span across datacenters. Unless a project was explicitly architectured for it, it probably can't because of it's technical choices.

Google is the evolution. All the tech is worldwide. You develop as usual and it comes out of the box.

oblio · on May 29, 2017

Ok... but what about lock-in? I use Google's magical pixie dust, great. Then at some point I want to get out. With MySQL I can do that. I can run it on 1 machine, I can cluster it to a reasonably large scale.

Can I run Google's magic machine on my own server(s)?

Not really...

user5994461 · on May 29, 2017

It's not a "lock in", not more than deciding to use MySQL or any other product. Please call that a design choice or a partnership.

You want to get out, you download your database data and move back to AWS pixie dust or self-hosted pixie dust.

Most services are pretty standard, it's not magic. There are comparable equivalents. The most advanced capabilities will only come with commercial products that are a million dollars (EMC, VmWare, F5, Akamai).

P.S. MySQL got took over by Oracle. Nothing is forever ;)

ciguy · on May 29, 2017

It's not as user friendly, but if you actually understand the complexities of VPC networking AWS allows a lot more flexibility with routing and (potentially) network level security than GCE. In general networking/ops guys love AWS because they can set it up exactly how they want, while developers who just want something working out of the box love GCE.

AWS really requires a management layer like Terraform in order to be maintainable though. Trying to manage all the routing and connections by hand for more than a single VPC is hellish.

Some GCE services are just not up to par yet as well. Their CloudSQL service for instance is a pretty horrible mess right now, while RDS is rock solid (I do cloud architecture and consulting so I have a pretty wide swath of clients on both services to compare with).

nodesocket · on May 29, 2017

GCE allows custom VPC setup, subnets, routing tables, and VPN's. They also have easy to setup cross project VPC connects and their firewall rules with tagging is superior to security groups.

I also find your comment arrogant and misguided as I also run a US based ops and DevOps consulting company:

> but if you actually understand the complexities of VPC networking AWS allows a lot more flexibility with routing and (potentially) network level security than GCE

sudhirj · on May 29, 2017

With all the ceremony required to get private subnet instances talking to the internet, why not just put all instances on the public subnet but cut off incoming connections using a security group?

kiallmacinnes · on May 29, 2017

Defence in depth? An extra layer between an attacker and your secret data is rarely a bad thing.

The tradeoff against convienence is worth it for some environments. E.g. what if AWS has an issue with applying security groups properly and it fails open?

Arguably, failing open is the right thing to do in some cases as it avoids causing an outage for your users. However it does expose them to a risk they may not have been prepared for.

I know OpenStack had an issue a few years back where, when the component responsible for managing the security group rule implementation was reset, it failed open until all rules were rebuilt (at the time, this was implemented as iptables rules on the hypervisor's). Another bug (well, poor implementation) meant it could take several minutes or more to rebuild the rules.

It happens, and you should be aware of and understand the risks of things like this happening when designing your infrastructure and choosing which tradeoffs are worth it for your particular use case.

sudhirj · on May 29, 2017

The defence in depth concept makes sense. The good news is that if I understand correctly, security groups fail closed - so that makes things a little safer.

Just spoke to an AWS Architect, and the points made were similar - more secure default state, less chance of screwing up - ham fisted attempts at security groups can open things up too much, private subnets are tough to expose unintentionally, adding multiple security groups makes the rules less restrictive, possibly in unintended ways, etc.

count · on May 29, 2017

Heh, that's called 'EC2 Classic' :)

hoodoof · on May 29, 2017

cthalupa · on May 29, 2017

Amazon doesn't know how many resources you're going to spin up and how many ENIs they will need, so it's hard to really tell you how big of a subnet to create.

You can also use the AWS NAT Gateway which doesn't require you run your own NAT instance

vacri · on May 29, 2017

Annoyingly, the VPC NAT Gateway is more expensive than running your own nat instance. If your traffic is low enough to be handled by a t2.small or lower, it's cheaper to run your own. Most of my NATs are nanos or micros.

AWS also double-dips on the traffic charges for the VPC NAT - you're charged for the traffic it transfers, but you're also charged for the same traffic in the general bill, from what I've been able to glean. Given that traffic is where AWS is not competitively priced, it's something to be cautious of.

Pirate-of-SV · on May 29, 2017

NAT Gateway support bursts up to 10 Gbps[1] and t2.small something like 125-200 Mbps [2]. For production use cases this can make a huge difference.

[1] http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/vpc-na...

[2] https://stackoverflow.com/questions/18507405/ec2-instance-ty...

vacri · on May 29, 2017

Sure, but you also don't need a big NAT Gateway if you're doing the majority of your external chatter via an ELB. Our setup is basically production load goes through an ELB, and incidental traffic goes through the default route (the NAT), so the NAT really only handles traffic like me ssh'ing in and wanting to install a tool, or config management setting up a new server.

But yes, if you're going to be sending big traffic through the NAT, a t2.small doesn't have the network performance.