Such a shallow dive: there really needs to be a lot more ink spilled on this topic in great depth. I've worked extensively with AWS over the last 4 years, and I can barely wrap my head around the scope of managing security in AWS. We have an entire department dedicated to security in our company, and none of them are remotely close to being experts in AWS security either.
I'm starting to get curious if there even is an expert who could set up and maintain a bulletproof AWS Account. From the dev/admin accounts to API Gateway to Lambda to RDS and S3; there's just too much to be an expert on. And it's all handled differently (not to mention how many times it's changed in my mere 4 years of experience).
Instances get role data from the metadata service, but containers can't access that metadata and should access the local ECS agent instead (which has its own API). Lambdas must assume a role with a dedicated policy to even write logs, but setting up a scheduled lambda adds an entire different permission object (with its own policy) just to allow the cloudwatch alert mechanism to even trigger a lambda. DB access can be authenticated using roles, but you have to manually set up the users (and their DB permissions), and it doesn't work with every DB type. S3 buckets can get policies from both built-in bucket policies and from users/roles, with inline or managed policies available for each. The API Gateway's authentication requires a Lambda function, but only passes through a single token and expects an IAM entity in response...
Even AWS' tutorials and built-in managed policies seem to throw their hands up in despair; throwing out wildcard permissions (like s3:*) left and right just to make things work.
No wonder people focus on their network and basic root account security - as frustrating and challenging as network security is, it is still a much more tractable problem.
> Instances get role data from the metadata service, but containers can't access that metadata and should access the local ECS agent instead (which has its own API).
Just a quick aside, but is this can't or shouldn't? I'm 100% positive you can use something like instance profile credentials from within a container (which loads credentials from the instance metadata service).
I think I agree that there's definitely a lot of depth to topics that should be covered here, and whether you want to go down the rabbit hole will vary based on org size and features you're using.
I'd personally prefer: 1. deep-dives into best practices for each feature as opposed to an on the surface glance.
2. enable it with examples. Include CloudFormation or Terraform scripts to set up each piece so that we actually build something. Documentation is important, but you can't learn without doing.
3. test against the security you've put in place.
Technically, shouldn't. But in AWS' documentation for container roles, they have a note that explicitly suggests implementing a iptables rule (and even provides the iptables command) to prevent access to the instance's metadata.
That said, this is another of those "more ink should be spilled" moments, since preventing access to the instance metadata is something that you SHOULD do from a security point of view.
I don't recall Task Roles being a thing when I started using EC2 Container Service. For container security and isolation, that makes a whole lot of sense.
Uh am i the only one that found none of that confusing or any more complicated than any other tech stack? Why on earth would i expect the same security process for a relational database and storage web server (s3). How is s3:* even bad? I mean clearly its a stupid default, so its not one, but a popular way amongst users to use buckets as construct for better or worse. You dont have to "throw out wildcards randomly" unless your devops situation is in shambles. Just like on any stack by any provider. Do you expect different policy expressions with any web server on earth?
It lets you do anything, including create, remove, read, update both objects and buckets. You can also change permissions on the buckets using the bucket policies.
> Why on earth would i expect the same security process
Because they do kind of use the same security process - IAM permissions. And not just DBs and S3 - all of AWS uses IAM in some way or another. But it's all just different enough to make it damned hard to create any kind of standard.
Some permissions can be fine tuned, some can not. Some can rely on conditionals, some can not. Some can be overridden at the object level, some can not. Some use roles, some users, some policies.
Understanding the differences - what works when - is what makes it all so hard to understand. And when it changes, all that knowledge is useless (or worse, dangerously incomplete) again.
This is expected. Each AWS service should be able to grow independently of each other (new features, integrations etc etc). It would be too difficult to standardize across services, instead, implement standards at a service level and maybe some more generic ones (non aws specific) outside of that.
Services should be able to grow independently. A consistent API doesn't have to prevent that. Look at Azure and GCE who don't same the same IAM rats nest.
> It lets you do anything, including create, remove, read, update both objects and buckets. You can also change permissions on the buckets using the bucket policies.
So if you don't want that, don't write it? I don't see how AWS forces anyone to make that infrastructure decision in any specific case. And its not default. You have to go out of your way to do that.
> Because they do kind of use the same security process - IAM permissions. And not just DBs and S3 - all of AWS uses IAM in some way or another. But it's all just different enough to make it damned hard to create any kind of standard.
So? The standard is running your own services and using any of the standard networking options. They are handling provisioning and networking so naturally they need an abstraction for security of network clients they are connecting for you.
> Some permissions can be fine tuned, some can not. Some can rely on conditionals, some can not. Some can be overridden at the object level, some can not. Some use roles, some users, some policies.
So? permissions, conditionals, asset policies aren't new or complicated and well within what you'd deal with in any tech stack in existence. Same goes for roles, policies, and users... Do those abstractions really seem that complicated? And in which stack do they not exist?
> Understanding the differences - what works when - is what makes it all so hard to understand. And when it changes, all that knowledge is useless (or worse, dangerously incomplete) again.
Thats fine, but arguably not better than any other service provider. You could also say those things about any software service out there - especially if its intended for consumption by engineers. Documenting APIs isn't a solved problem for anyone.
I wonder at what point the complexity of interacting security systems hits dimishing returns? Not trying to be negative, I'm a fan of how much power AWS gives you. But seeing how many systems have interacting security implications laid out in a graph like that makes me curious how far you can take it before it becomes difficult to reason about. Maybe the systems are sufficiently isolated and well defined that it's not even an issue.
> Not trying to be negative, I'm a fan of how much power AWS gives you.
I am.
I find AWS API incredibly baroque and has a lot of historic baggage. I suspect a lot of this complexity is a result of an accumulation of features made by multiple people in multiple teams over the years and inertia of customers relying on it, so there is (understandably) no will to change it.
Classic mistake. Except it's not really a mistake, but a conscious decision by whoever was in charge at the time (with the main focus probably being growing the company and not hurting current customers).
How do we fix that though? Standards seem like the only solution but they either don't move fast enough or the early birds (in this case amazon, but another prime example is microsoft) become so entrenched they set the standard themselves.
My own answer up until now has been to work in linux and open standards jobs (now kubernetes) but this requires increasing amounts of effort.
I'd say that the best way to "fix" it - is to have new iterations/versions of an entire region that comes online with an updated stance on all aspects of environment management: deployment/security/auth etc..
Let new infra come up in the new region with auth-gateways to allow the new to talk to the old and vs versa...
maybe you put an S3 mirror of data from new-bucket-type to an old-bucket-type for RO data access from within the old region for data created in the new...
old users can make functional requests of the new api - but cannot manipulate anything directly...
Or some such model -- but role out wholly new regions and sunset old over time. (A new region can be us-east-3 next to us-east-2 and can sit in the same physical location to allow for in-house data transit on AWS' part, etc.)
I've always found AWS's security system pretty confusing, so I'm a big fan of this primer. There's such a huge number of stuff that it can difficult to even begin to know what to look for. From my limited experience, Google Cloud Platform seems to be much easier to setup.
What I'd really love to see an end-to-end example of a non-trivial production-ready project, with all its nitty-gritty details. I'd expect that having a sensible baseline you could look to for general guidance would help improve security and reduce risk.
I found trying to manage and reason about AWS access control super confusing (especially across accounts), so I built a lightweight tool to dump and load IAM config to yaml files. https://github.com/99designs/iamy
It has recently started becoming popular quite organically, so I might just write a blog post on it soon.
I'd say the biggest advantage is that it slots in easily to an existing environment that is not necessarily managed strictly.
I've found depending on how strict your change management policies are, IAM creds can collect cruft over time as people push new policies in ad-hoc. So iamy is handy for such a situation
- iamy can sync in both directions - pull and push IAM config. So you can easily pull down the ad-hoc changes
- In order to use CFN you need to have access, so there is a chicken-egg scenario if you want to manage ALL users in config
- iamy gives you a nice execution plan of aws cli commands, CFN can be opaque
And iamy does ignore any resource managed by CFN, so it works well as complimentary tool.
The UI is confusing, hard to read and unapproachable for non-technical people. You guys should really hire some UI guru and get him to re-design and re-word the whole system and create an easy to start, plain english language tutorials.
Ok, good start, but pretend that I am dense and give me a specific example along the lines of "The W console makes it hard to do X because it says Y. Why not say Z instead?" Vector values for W, X, Y, and Z are acceptable.
Hey Jeff, long-term AWS fan and daily user of it here. One thing I see time and time again for new user confusion: the names given to AWS services.
It has always felt that AWS has been too playful with the naming of services, to the point of obfuscation. Sure, you and I know what EC2 and S3 are, and what an instance and a bucket are. But for new companies adopting AWS, I swear that a third of my time is translating AWS service names into industry terminology for them, and often here statements like:
- "Why don't they just call it a virtual machine or cloud storage?"
- "What the heck is an EBS or a Cognito?"
- etc, etc
Also, the first run of the AWS console can be overwhelming when compared to that of Digital Ocean (though I know the two aren't really comparable in terms of breadth of services offered, but look how obvious DO's call to action is).
They also still count against your usage limit, so workflow gets interrupted and you have to wait till they are actually deleted, but you can't really be sure when that happens.
So have a coffee, check back, nope not gone.
Get another one, nope not gone.
Wow, I think the opposite, have you tried using Azure (yes their current iteration of their UI)? It's a nightmare, the Amazon UI is fast and efficient, please don't add UI bloat.
It's been a couple months so my cache has mostly flushed, but:
Properties
InternetGatewayId
The ID of the Internet gateway.
This isn't the only example of entirely useless documentation.
Documentation for VPC users and VPC-less users are munged together. VPC users don't care about VPC-less arguments and documentation, and will never (since new customers must use VPCs). They should be completely separate documentation sets.
AMIs are critical but you have to dig into examples to find an up to date list ids, they're spread across multiple examples in multiple locations of the docs with no directions to find them.
Default values for properties and configurations don't appear to be documented anywhere. There are warnings about deleting the default VPC but no mention of how to remake it. Is the default VPC magic?
Object ids must be [A-Za-z0-9]. Why? Neither JSON nor Yaml have syntactic issues requiring this.
The documentation talks about Redis (cluster mode disabled) and Redis (cluster mode enabled). Redis (cluster mode disabled) is referred to as a cluster. But cluster mode is disabled? And on some pages the documentation uses "shards", on others "node groups", sometimes both - apparently these terms refer to the same things.
Scaling Redis (cluster mode disabled) Clusters only discusses single node clusters. Multi node (cluster mode disabled) clusters are discussed in Scaling Redis Clusters with Replica Nodes. Scaling Up Single-Node Redis Clusters shows as Scaling Up Redis Clusters in the sidebar.
Adding nodes to a cluster currently applies only if you are running Memcached or Redis (cluster mode disabled). Adding nodes applies to all clusters. AWS doesn't support adding nodes to an existing cluster in a partitioned Redis set. This is elucidated in the unlinked page Scaling Redis Clusters with Replica Nodes in a completely different section.
Articles frequently contain irrelevant asides, which makes following the documentation harrowing. Like Because Redis (cluster mode disabled) does not support partitioning your data across multiple clusters, each cluster in a Redis (cluster mode disabled) replication group contains the entire cache dataset. in the middle of Scaling Redis Clusters with Replica Nodes.
Excellent! Pleased to see VPC Flow Logs included; they are underrated as a security tool and one big advantage AWS has over other providers.
At work I co-develop an open source Python library for reading VPC Flow Logs - it can be an easy way to get started analyzing them for security:
https://github.com/obsrvbl/flowlogs-reader
This is a great advertisement for Google Cloud. Google Cloud's security model is much simpler. For the 95% case, you can have different projects for each app and environment, and assign access permissions per project. Then you can gradually add in shared resource access when you need it. To get the same kind of isolation by default with AWS, you need to have multiple AWS accounts (AFAIK), which is a giant pain.
I find AWS security "primers" are usually quite brief - and necessarily so, because otherwise they wouldn't be a "primer". But that brevity leads to a common set of rules which, in the absence of a full explanation, lose something.
For example, just every AWS environment I look at, someone knew that should create an IAM account, and never use the root account. Which is why there is a root account that's never used, and one IAM account with "Administrator" permission that everyone shares.
If you ever propose we review it, someone will point me at an AWS security guide and say "it's fine, we're not using the root".
This is the problem with "knowledge by web article" in general, isn't it? Blindly following best practices, rather than developing any true understanding of the task at hand.
Great write up. I knew there was a lot, but visualizing really puts it into perspective, especially for the more niche services like cognito & IOT.
My current job has about 14 different AWS accounts, a few are prod, some are lab and others are meta accounts. I've been thinking about having a dedicated account just for security related stuff but I see the value in collect cloudtrail, config and other stuff but, I'm not 100% sure it's worth the effort to get setup right now. Thoughts?
Cloudtrail and config buckets should be in a separate account with no access at all (besides root) otherwise an attacker can delete the cloudtrail logs and you have no idea what he did
I'm starting to get curious if there even is an expert who could set up and maintain a bulletproof AWS Account. From the dev/admin accounts to API Gateway to Lambda to RDS and S3; there's just too much to be an expert on. And it's all handled differently (not to mention how many times it's changed in my mere 4 years of experience).
Instances get role data from the metadata service, but containers can't access that metadata and should access the local ECS agent instead (which has its own API). Lambdas must assume a role with a dedicated policy to even write logs, but setting up a scheduled lambda adds an entire different permission object (with its own policy) just to allow the cloudwatch alert mechanism to even trigger a lambda. DB access can be authenticated using roles, but you have to manually set up the users (and their DB permissions), and it doesn't work with every DB type. S3 buckets can get policies from both built-in bucket policies and from users/roles, with inline or managed policies available for each. The API Gateway's authentication requires a Lambda function, but only passes through a single token and expects an IAM entity in response...
Even AWS' tutorials and built-in managed policies seem to throw their hands up in despair; throwing out wildcard permissions (like s3:*) left and right just to make things work.
No wonder people focus on their network and basic root account security - as frustrating and challenging as network security is, it is still a much more tractable problem.