Hacker News new | past | comments | ask | show | jobs | submit login
AWS Nuke – delete all resources associated with AWS account (github.com/rebuy-de)
222 points by fortran77 on July 2, 2022 | hide | past | favorite | 119 comments



Shout out to AWS batch, where if you delete the role assigned to a compute cluster the cluster itself becomes impossible to delete.

Found this out after using AWS nuke


This is because neither AWS nor Azure use referential integrity in any of their "cloud scale" databases. For example, Azure uses some hideous JavaScript-based document DB where things like renames, moves, and deletes are hit & miss at best. A never-ending whack-a-mole of bugs and issues.

Remember boys and girls: Being "cloud scale" means data corruption and referential integrity violation!


Not only referential integrity, but also not supporting Read-Your-Writes[1]. Cloud scale! Nothing like,

1. Create resource. Success.

2. Attempt to use/reference first resource in another resource/call: Failure: Referenced resource does not exist. Odd.

3. Create resource again? Failure! Resource already exists.

Our scripts have so many retry loops and arbitrary pauses mixed in to account for garbage like this. Distinguishing "did the call fail?" from "or is the system just lost in the land of eventual inconsistency?" ugh.

And yeah. shout out to Azure AAD where I can have a role assignment that is granting permissions to "unknown". We call them ghosts.

[1]: https://jepsen.io/consistency/models/read-your-writes


> Our scripts have so many retry loops and arbitrary pauses mixed in

This is unfortunately your fault, not Azure's. Their API is explicitly designed around this eventual consistency and weak references, so your client side must take this into account. Typical bash, Python, or PowerShell scripts are the Wrong approach with a capital W, and you will forever be tearing your hair out if you persist on using them. (Or any similar imperative deployment mechanism)

The only robust method is ARM Templates, or better yet, Bicep templates[1]. The latter simply compile down to ARM, so they're essentially equivalent but terser and with nicer tab-complete.

Compared to scripts, templates have key advantages:

1. Built-in incremental / differential deploy capability. A partially deployed template can be simply redeployed[2] to "fix it up", without requiring client-side logic for every corner case.

2. Can deploy multiple changes that would fail if deployed step-by-step. For example, App Gateway can have intermediate configurations that won't validate on the way to a valid final configuration. This is madness to unravel with scripts. Templates generally just take you to the final configuration in one step.

3. Inherently parallel. Anything that can be deployed concurrently will be. Anything. No need to write complex and error-prone parallel loops on the client side!

4. Largely immune to temporary failures like missing reads after writes.[3] The template engine has a built-in retry loop for most (all?) fallible steps. You'll see it has "failed"... and then "succeeded" anyway.

[1] https://docs.microsoft.com/en-us/azure/azure-resource-manage...

[2] Most of the time. All resources should be idempotent to redeployment, but many aren't because this is not mechanically enforced. IMHO, this is just shoddy, shoddy engineering and everyone involved should be ashamed. Being nearly idempotent is like being nearly pregnant.

[3] You still need a few tricks up your sleeve for robust deployments. Anything outside of ARM, such as Azure AD groups and RBAC tend to be a PITA. Generally you want to wrap your deployments in a script that takes the object GUID of the created group and feed that into the template. That works, because the GUID can be used even if the full object is not fully replicated around yet.


Yeah, so I'm hitting this again today, so now I can comment more knowingly.

> Generally you want to wrap your deployments in a script that takes the object GUID of the created group and feed that into the template. That works, because the GUID can be used even if the full object is not fully replicated around yet.

In the case I have today (wanting to perform an action on a newly created application), this trick doesn't work. (We make the call by specifying the application by ID, too.)

"Bad Request […] It looks like the application '[ID]' you are trying to use has been removed or is configured to use an incorrect application identifier."

It's not removed, of course, and the ID isn't incorrect.

Edit: actually, it is worse than that. So, there's the above error, and that's essentially a failure to have read-your-writes.

Our scripts retry on that, because our scripts have become accustomed to AAD's shit. But we eventually hit this sequence of events:

  1. Create the app
  2. Grant admin-consent
  [other necessary setup]
  3. Create an AKS cluster: <this fails>
And it fails because the App need to have admin-consent granted on it. But we do that, in step 2, and my logging is now good enough to show that we not only retry it after a read-your-writes failure, but that the command eventually succeeds, but the UI doesn't end up reflecting that. This is not a read-your-writes failure, this is a lost write!


To be frank, that's a lot of words to say that, instead of fixing the bugs at their core, MS wrote an entire product to try to work around the bugs in the original product, and now wants me to use that layer instead. And, yeah, that's about how MS sees the world. But "yeah, no" is the serious answer there, and we're moving anything we can to Terraform first.

Most of the failures we see are with AAD, which, AIUI, ARM templates do nothing for.

Even within ARM, as I understand them, ARM templates cannot handle deletes or changes. (They are deployments of new resources.)

And even if one could use templates, that just abstracts the same problem: how long do you wait for the template, if it has finished? (we see changes in ARM take >30 minutes to effect, and even after completion, it can take more minutes for things to "settle", i.e., successive requests to reliably return the same result. It just bottles it all into one highly inconsistent box, maybe, and requires me to learn an entire language on the side.) if it hasn't?

> That works, because the GUID can be used even if the full object is not fully replicated around yet.

Interesting. I'll keep that in mind.


Not even redshift enforces uniqueness on their primary keys or referential integrity on their foreign keys. That was a fun finding out…


> That was a fun finding out…

Not my idea of a good time, but whatever floats your boat. Hopefully you had a backup?


That was almost certainly sarcasm.


I don’t think it is - the issue is that the Batch service needs to assume your role in order to clean up associated cluster resources like auto scaling groups.

If the role is deleted, it can’t do this.


That's what referential integrity violation means: an essential related piece of information can be deleted without the parent/using object also being deleted.


Sure, in an abstract sense maybe, but in a cross-service context it’s got nothing to do with their “cloud scale databases”.

In fact, being able to delete a role without removing any associated resources is a feature, not a bug. And how would you even ensure referential integrity in this case - you would achieve the same effect by modifying the assume role policy but keeping the role around.

You could craft a policy that only allows Batch to assume the role on a Tuesday for example.


Wedging a cloud resource so you can't delete it is always a bug.


Maybe. But the GP above has a point. What does a role have to do with this particular resource. You can’t block role deletion because they are associated with resource, that would be a functional nightmare. You also can’t just cascade delete all resources associated to a role. The only real thing that can be done is assign the resource to some super user with all the rights so they can delete it instead.


> You can’t block role deletion because they are associated with resource, that would be a functional nightmare.

Asserting this doesn't actually make your argument for you.

Why would this be a problem?


It doesn’t make any sense. Forgive me for saying this, but if you don’t really know how all this works then it’s a convenient throwaway thing to suggest.

So I think it’s up to you to explain how this would work when there is no difference between a role being deleted and a role being inaccessible.

What exactly would you do if I block instance with the ID “xyz” from assuming the role I have assigned it? How would you detect that I’ve done this in every single case?


That's an inline policy which is attached to the role. And notably AWS roles require all policies to be removed before they're deleted (which I know because I spent a bunch of time recently fixing ordering issues with CloudFormation deletes).

Again you keep just asserting this isn't possible: why? AWS are aware you need the role to exist to delete the instance (as noted up in the OP), why apparently is it completely inconceivable that the IAM system would check for this condition before executing an action - which again - irreversibly wedges a delete operation for a resource?

Unless you have some deep knowledge of how AWS IAM is implemented which makes this literally impossible, then you're asserting fluff.


It’s possible, it’s just convoluted, introduces a weird circular dependency, not always the behavior you want in the general case, possible to do without requiring changes to the IAM service via cloudfront or terraform, would require IAM to support every single possible “delete” variant (delete an instance with a backup? Without one? Final RDS snapshot name? Etc etc), would need to recursively delete resources, would require a complex set of APIs to track the asynchronous deletion process, and as stated several times before completely ineffective against conditions where the role hasn’t been deleted but the service cannot assume it. Which leaves you in exactly the state you’re trying to avoid.

In short: it would be a confusing mess for nebulous gains that doesn’t pass any kind of smell test. Instead they should just… fix the AWS batch service.

The better solution is to provide an API and console tab to show you what services last used the role, when they used it and how they used it. Which is what they do.


Yes, in the service that provides the AWS resource.

Because it didn’t handle the fact that the role it’s using might be deleted or otherwise rendered un-assumable for a variety of different reasons at any point in time.

Which is a feature. Not a bug.


I’m curious how you see this as a feature when it can get you into a very expensive and unresolvable situation; a AWS resource can’t be deleted and is running up costs. You’re at the mercy of AWS support.


It can also be a security vulnerability if the resource that cannot be deleted is compromised and can access or contain critical data for example.


Can it? The only case I know of involving roles specially is Batch, and the resources it’s trying (and failing) to clean up are ones with absolutely no cost.

It’s a feature because there are plenty of cases, such as a role being compromised, where you don’t want to cascade-delete every single associated resource without any need.

If you want this then you can opt into it by using cloudformation.


It’s not a feature but how do you solve this? You can’t block role deletion because it has a resource associated with it, that would be a functional nightmare.


A very classic way of doing would be to prevent deleting a role that are resource tightly associated to it by default. If someone really wants to delete the role anyway, you can provide a feature to do that and delete associated ressources in cascade.


It’s like you didn’t even read what I wrote. That way sucks, especially if you have hundreds of allocated resources. Cascade delete is not a realistic option here either. Imagine someone leaves the company without notice. All their resources need to be deleted?


When someone leaves a company you look at their resources and assigned new people to them or delete them. But you should prefer teams and projects over individuals anyway.


"Cascading deletes"


There's loads of stuff like this.

If you detach an EC2 instance from a virtual tape library and destroy it you can't delete the tape library any more. Even AWS support couldn't delete it. This is fine until you have 60TB of tapes online and are paying for it.

Fortunately we found an EBS snapshot of the VM.


This is my worst nightmare. I had a situation like this where I was able to essentially ignore the entire AWS account and cancel the attached credit card. Amazon somehow linked this AWS account to my retail Amazon.com account and started booking my personal credit card.


Even better if you have your cloud formation with the iam role in the same stack and there is something wrong the the rollback can happen such that the iam role gets deleted standing the compute environment and failing the rollback.


Update the stack to use a different role with cloudformation in its trust policy, then you can delete the stack with the new policy. Has to be two operations to work.


That's a weird one, but surely an aws-nuke bug? It must already use a deliberate order - there's plenty of resources that need anything linked/constituent deleted first - so that order is/was just not correct for those?


No, Batch's interaction with IAM roles and permission is super weird and not at all documented.

It is easy to screw it up.


I've never used AWS Batch but heard only good things about it. If you could elaborate - i.e. warn me - I would very much appreciate it. What should I know before using AWS Batch?


First is the problem as described. If youndelete the ComputeEnvironment Role you can't delete the compute environment itself.

If you male a mistake setting up the Batch ComputeEnvironment using cloidformation and it rollbacks then the cloidformation error message is useless (resource failed to stabilise) and no trace is left behind for you to check.

Two, Batch needs to use servicelinkedroles. If you create an AWS thing that needs a servicelinkedrole via the console it gets creates for you automatically. If you created via CLI or Cloidformation it does not. So if someone has created a batch env in your account before it will probably be there but if you are setting up in a new account you might have no clue whu things arent working.

Compute Environments and JobSpecs that use Fargate are configured differenrly, requiring a extra IAM role, from EC2 batch envs.


Thank you! You've probably saved me much head scratching down the line.


What I mean is 'yes that's a weird interaction' but, if that's how the AWS service works (however weird) then it's an aws-nuke bug that the ordering isn't correct (to account for that weirdness)?

Though 'not at all documented' makes that harder of course. I haven't used it.


Aws batch is a terrible product


AWS is a terrible product, unless you happen to be in specific corners they've bothered to document/make their UI architecture not a horror show.


Nah I've seen this too. It is considered an expected behavior. In a way it makes sense. But it's not well documented.


But that's what I mean, if that's how it works, nuke needs to account for it. I assume it already does for other dependencies, because there are a lot of them - including that do make more (obvious) sense.

You're not the only one to complain about the documentation though, so it's easy to see how aws-nuke apparently missed it!


Not sure why but I laughed out loud reading this.

Edit : I've ran into this but it's considered a feature not a bug!


The idea that we can actually delete anything from a non physical system the hyper visor of which we do not control is absolute nonsense. Unless you have control over the physical system and the hypervisor all you can do is destroy your ability to access the information. We can never have confidence of how it truly works back end.


Try logging into the console as root, that lets you delete anything.

You can also try updating the cluster to have a new role.


That’s… not the issue. The cluster gets stuck in a deleting state forever. And at which point you’re unable to update it.


Time to open a support ticket, I’ve had a few things get into a bad state where they could not be updated or deleted, the backend team can step in, but first line support need to see the usual solutions fail before they escalate to them.


Sounds like it’s a bug in AWS nuke as well then.


99% of our AWS resources are terraformed, but developers constantly push back / want to use the console to create stuff to test or play around with. So we setup a separate "hack" AWS account and give them admin access in there, and have an automated job using AWS Nuke to delete everything in there once a quarter.


We give each one of our developers their very own aws account managed through AWS organizations service. They are full administrators and responsible for resources and cost.

So far we haven’t had any issues or bad surprises, although we have setup some aws billing alerts just in case.

Feel free to make them responsible for cost and resources and you’ll be surprised how well they can manage their own account.


> Feel free to make them responsible for cost and resources and you’ll be surprised how well they can manage their own account.

Wow, this is horrible. I understand responsability but this is too much. Are other employees responsible if the company loses money for their actions?


It is quite common to have budgets employees have to work with.


Ever had a company credit card?


I suspect it's more the individuals would be warned and re-trained if they didn't keep their costs under control (usually it's done at the team level) rather then having actual financial responsibility.


Not sure what the problem is, if anybody exceeds the expected « normal usage » we simply get in touch and fix the issue.

Lessons learned for everybody, it’s a win-win situation.


I think it really depends upon a number of factors but even pretty smart people can make stupid mistakes, especially when it comes to security in AWS. I’m familiar with several cases where engineers fired up old AMIs and got the instances compromised within an hour because they were running old, vulnerable software and ran it in a publicly routable subnet. There’s some basic rules to follow that can help avoid issues like those though that as organizations scale need to enforce to a greater degree eventually. Disallowing provisioning their own VPCs, disallowing publicly routed subnets, and establishing some decent auth infrastructure is all a good start that will work for a long time and have minimal friction for users. I’m a strong believer in security as a UX problem where doing the Right Thing should be easier than doing the Lazy / Bad Thing so I feel if people are having issues doing things the right way I’ve messed up and need to improve usability and meet my users where they are to achieve my own goals of a secured infrastructure.

Giving people responsibility and autonomy also comes with some responsibilities by the providers in a shared responsibility model is all I’m saying and every policy works out fine until it doesn’t.


> We give each one of our developers their very own aws account managed through AWS organizations service. They are full administrators and responsible for resources and cost.

How many developers work at your organization?


25 at the moment :)


> although we have setup some aws billing alerts just in case.

My experience with these has been decidedly mixed. As in, you define them and never, ever see an alert.


Hmmm, weird.

We always get the alerts in time with thresholds set to 70% of the wished value.


I'm more worried about someone inexperienced with AWS accidentally doing something really expensive than any kind of intentional abuse.


If you make a mistake with excessive resources allocation, you can get in touch with aws and ask for a refund and they will gladly do so.

I’ve had to do it a couple of times for personal and profesional accounts, and I’ve never had any rejections from them


We have done something similar: a sandbox account that developers and solution designers can play around in, experiment and manually create resource in as long as resources are properly tagged (we have devised internal naming conventions). They are also responsible for the clean-up; a resource that is not properly tagged is purged automatically after a 8 hour time lapse.

Other accounts and environments (including dev) require everyone to follow a streamlined process: read only access to the account, a fully documented solution design and a corresponding terraform project in GitHub. terraform project checkin triggers a pull request for a review and an approval. Once the pull request has been scrutinised and merged, CI/CD runs terraform to provision resources in the account.


I would love a `terraform <plan|apply> --nuke`.

Not `destroy` - the opposite - destroy/nuke what I don't have in my config.


That's one of my big complaints with terraform, and it's state-based system. It doesn't inform you if there are resources you don't know about.


It is a feature and a good one (with a caveat though).

terraform is best used on a per solution basis: one solution, one dedicated terraform project that will manage its state and its own state only. Multiple projects and people work carry out work in the same cloud account in parallel, and their terraform projects are not meant to interfere with each other. It works best for solutions that make use of fully managed cloud services.

Then there are also platform or connectivity level cloud resources (e.g. Transit Gateway and subnets that are mapped into the internal organisational network address space in AWS) that a random terraform project ought not to manage.

Lastly, if there is an actual need, a resource that terraform does not know about can be manually imported into the terraform's project state. This works best when infrastructure level resources were manually created a while ago and now have to be refactored into and managed by a terraform project. It is a tedious process that has to proceed with a lot of caution.

The caveat. It gets somewhat tricky when a non-serverless cloud resource requires an explicit subnet range allocation within an existing and managed CIDR or similar. There is no one solution to fit it all but containing such projects to their own dedicated VPC and setting up the VPC peering between a solution specific VPC and the main account VPC usually works satisfactory. That is, for example, how Kafka (AWS MSK) can be introduced into an AWS account without affecting existing CIDR mapping.


Each of your projects could use a distinct terraform workspace; the hypothetical `terraform <plan|apply> --nuke` would already need to look across all workspaces, considering all state. Or/also it could look at multiple remote states.

> Then there are also platform or connectivity level cloud resources (e.g. Transit Gateway and subnets that are mapped into the internal organisational network address space in AWS) that a random terraform project ought not to manage.

Not a 'random' one sure, personally I'd still want it somewhere. I suppose the hypothetical command might want an optional whitelist of non-tf-managed stuff to ignore though. (But then, you could whitelist it just by writing the terraform and importing it?)

> Lastly, if there is an actual need, a resource that terraform does not know about can be manually imported

The hard/annoying part that I'd like this command for is discovering these resources. i.e. it's not just that terraform does not know about them, it's that I probably don't. Or at least I don't realise they're not captured in terraform.

A very easy one to overlook is security group rules: unless you define them inline in terraform (i.e. ingress/egress blocks on a security group resource) then adding additional rules outside of terraform does not cause a diff. So you might be testing them out by manually poking around, and then you forget to terraform them/remove them, and they're left there forever with terraform blissfully unaware, and if you ever happen to notice it might not be obvious whether they're needed or not.

Essentially, it'd be useful for enforcing that terraform's used for everything; maintaining 'IaaC' hygiene.


Oh that would be glorious. Not sure how it'd be possible though.


I've hitherto assumed it didn't because it wasn't, short of calling every single `*:Describe*` API anyway.

But the existence of aws-nuke makes me think (I haven't looked into what it's doing yet) there must be a better way of discovering used services/resources. Through billing perhaps?


We've built an (open source) asset inventory product just for that use case, to continuously discover all inventory, not just what's provisioned by Terraform (or Pulumi, etc.).

Link to the repo in my HN profile.


May be possible using driftctl and some API calls.


I've used aws-nuke a bunch but the use case seems significantly diminished now that AWS Organizations has the ability to delete entire accounts.


Last time I've checked it was a lengthy process involving attaching a credit card, leaving the organization and then deleting the account. Has it been changed?


I'm in the process of doing this now. You can close accounts from Control Tower without needing to log in as root to each separate account, adding a credit card, removing it from the org, and then closing it manually.

However, you can only close them from Control Tower at the rate of 2 to 3 per month, due to a hard limit quota which cannot be changed, even if you request it. Needless to say, this sucks when you've followed AWS's own best practices and created lots of accounts using Control Tower's "vending machine."

AWS's archaic account model is one reason we've switched to GCP.


It seems to me this cannot possibly be a hard limit. If it’s a hard limit it’s only because AWS wants to milk you dry.


To be exact, the hard limit is: you cannot delete more than 10% of your organization's accounts (capped at 200) via AWS Organization within a 30 day rolling window. You can always delete an account by going into it as the root user.

https://docs.aws.amazon.com/organizations/latest/userguide/o...


I suspect it’s a hard limit to prevent disgruntled (former) admin blast radius.


If that was the concern then surely they could enable support to restore accounts during some grace period. The machinery for that already exists for people who completely close their AWS accounts down.

Maybe it was quicker to implement with a hard limit, or there is some internal service that can't easily handle large volumes of account removals. But if it was Amazon losing money from this limitation instead of the customer I imagine it would be fixed pretty quickly.


Given that the potential disgruntled admin already has access to nukes, this seems useless.




I would think they should probably change the name so it doesn’t come across as an official AWS product but seems nice.


https://github.com/genevieve/leftovers

A co-worker made this when we worked together on a project that ran a large number of terraform configurations in CI against real IaaSes. Each account was a sandbox, so we would run this at the end of a pipeline to clean up any failures or faulty teardowns.


This fork is 36 commits ahead. It aims to stabilize deletion (so far for AWS) and add extended regex filtering support.

https://github.com/notrepo05/leftovers


Unlike aws-nuke, there seem to be actual automated tests here for this tool. That's really neat!

Though, I do feel the project reaches out to quite a lot of IAAS's for its level of current maintenance, which seems to be somewhat zero.

aws-nukes gets contributions here and there and maybe that's because it's quite focused.

That's a real shame, the implementations for each provider look to be great, standalone by themselves.


Wow! This is actually some great work!


I noticed that this GCP article about Terraform best practices linked to a similar tool: cloud-nuke (https://github.com/gruntwork-io/cloud-nuke)

https://cloud.google.com/docs/terraform/best-practices-for-t...

> After you run the terraform destroy command, also run additional clean-up procedures to remove any resources that Terraform failed to destroy. Do this by deleting any projects used for test execution or by using a tool like cloud-nuke.

> Warning: Don't use such tools in a production environment.

It seems like this tool only supports AWS though, at least nowadays.

I kind of feel bad for the GCP people here. Damned if you do (link to an external project, which might change), damned if you don't.

Besides this thing, I really enjoyed reading this article though! AWS is missing this kind of content.


In GCP, the tool is not necessary, because you can simply delete the containing project to "nuke" any resources attached to it.


There is a 30 day "marked for deletion period" after you do this. The resources related to the project aren't immediately nuked.

https://cloud.google.com/resource-manager/docs/creating-mana...


Deleting a project does immediately turn off (stop billing) the vast majority of associated resources, including VMs. The main exception is storage, but that cost is typically fairly marginal.


Oh, ok. That seems fair and makes sense. (I wish that had been documented.)


Are Azure resource groups still pretty comprehensive? You use to be able to create all your stuff in a resource group. Delete the group and it deletes all the stuff with it. Super convenient.


Yeah they seem to be. Except Azure AD stuff, which kinda makes sense (still wish that part were more easily automated)


It astounds me that aws doesn't have a way to do this built in. Or even just a way to list all the resources you have, across all services in an account, without having to make thousands of API calls and hope you didn't miss a resource type.


Shout out to my Azure homies enjoying their Resource Groups.


I’ve contributed to and used this tool extensively. I have accounts where I run this thing out of a CodeBuild job on a cron schedule to dejunk after a day of terraform development. Fantastic tool.


Funny enough, on the completely opposite end of the spectrum, I was once surprised that after you close an account (post M&A in this case), you can typically restore the resources for ~90 days if you decide you want it back and don't nuke before you request the closure. Can be useful or scary depending on the contents of the account...


Yeah it’s actually a selling point in case a disgruntled employee or hacker gets credentials and asks for the account to be shutdown.


Namespaces could solve all of that. Why AWS haven't added that yet? I understand that they rarely change the service, but c'mon, we're in 2022.


How does that help them make more money?


this will shave 20% off of Amazon's earnings!


This is a fantastic utility to have in my back pocket, and the attention to safety is commendable. nice!


Needs a flag to run non-interactively


I nearly took this bait


`aws-nuke --no-dry-run --non-interactive --i-really-mean-it --yes-i-agree-to-the-second-prompt-also`


so i have this weird monthly charge I cannot for the love of god find out what is causing it. I cancelled my credit card and now my credit score has taken a hit.

All because some unknown AWS service that is in some zone that I cannot find at all and neither can AWS support.


I love how everyone wants to help you/figure out the mystery bug, because, after all, you just need knowledge and experience to maybe avoid getting screwed around by AWS.


That seems like the sort of thing chargebacks are intended to solve. Your credit shouldn't be taking a hit for that.


What does the account statement say?


Is the charge the same every month?


This is also a useful tool just to find resources hiding within an account. It defaults to performing a dry run (it’s a PITA to actually delete stuff, on purpose).


Hello. I am one of the maintainers. Sorry that we did not spend much time for aws-nuke lately.

Feel free to ask any questions.


I sure wish cloud providers had a "fake" implementation of their APIs that I could use in tests, for cases like the one that this program was written for. There are so many cross-dependencies that can be checked ("sorry can't create that VM, you need a service account first" or "VM names can only be 69 characters and can't contain '!'") without creating actual resources. You can test all the edge cases that resolve to failure in a unit test, and then only need one integration test to get that end-to-end "when I run 800 Terraform files, can I talk to my webserver at test239847abcf123.domain-name.com?".

I've found that the alternative to this is either a CI run that takes hours, or simply not testing the edge cases and hoping for the best. The former is a nightmare for velocity, the latter is a nightmare for having more than one person working on the team ever.

Lately, I've been fortunate to not be working in the "general cloud provider" space (where everything is a black box that can change at any moment, and documentation is an afterthought of afterthoughts) and have only focused on things going on in Kubernetes. To facilitate fast tests, I forked Kubernetes, made some internal testing infrastructure public, and implemented a kubelet-alike that runs "containers" in the same process as the test. For the application I work on at work, we are basically a data-driven job management system that runs on K8s. People have already written the easy tests; build the code into a container, create a Kubernetes cluster, start the app running, poke at it over the API. These take for-fucking-ever to run. (They do give you good confidence that Linux's sleep system call works well, though! Boy can it sleep.) With my in-process Kubernetes cluster, most of that time goes away, and you can still test a lot of stuff. If you want to test "what happens when a very restrictive AppArmor policy is applied to all pods in the cluster", yeah, the "fake" doesn't work. If you want to test "when my worker starts up, will it start processing work", it works great. (And, since the 100% legit "api machinery" is up and running, you can still test things like "if my user's spec contains an invalid pod patch, will they get a good error message?") Most bugs (and mistakes that people make while adding new features) are in that second category, and so you end up spending milliseconds instead of minutes testing the parts of your application that are most hurtful to users when they break. (And, nobody is saying not to run SOME live integration tests. You should always start with, and keep, integration tests against real environments running real workloads. When they pass, you get some confidence that there are no major showstoppers. When they fail, you want the lighter-weight tests to point you with precision to the faulty assumption or bug.)

Anyway... it makes me sad that things like aws-nuke are the kind of tooling you need to produce reliable software focused on deployment on cloud providers. I'd certainly pay $0.01 more per VM hour to be able to delete 99% of my slow tests. But I think I'm the only person in the world that thinks tests should be thorough and fast, so I'm on my own here. Sad.


to me its obvious amazon managers intentionally avoid giving us such a delete all resources button simply to maximize profits


The fact that this is hard enough to do it requires a dedicated third-party tool


I can barely imagine what is in my AWS account. The last time I used it was 2011.


Project sponsored by Azure ?


No, we use it to clean up after our Terraform integration-tests. It is not always possible to use `terraform destroy`, because it might happen that the state does not allow it (at least that was our experience a couple of years ago).


I deleted my AWS account because SES refused to take me out of sandbox due to a security violation that could not be disclosed to me. Interesting that almost all forms of SMTP are blocked once you are on some secret black list.


What’s the fallout?




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: