Hacker News new | past | comments | ask | show | jobs | submit login
I made a mistake with Terraform and Azure made it worse (craigstuntz.com)
170 points by todsacerdoti on July 8, 2021 | hide | past | favorite | 110 comments



You can (and always should) decouple the plan and apply steps. For a non terminal approach this can be done by outputting the binary plan to a plan file (with -out=path) and reviewing the plan terminal output which will show all actions Terraform will perform in human readable format. Then have the plan file be used as input for the apply step. The apply won't perform any other action that was not in the plan file (which matches the output that was reviewed) and if state of the environment it would apply to has changed in the meantime it will abort without causing undesired changes and you can restart the process again.

There is also the prevent_destroy[0] meta argument for resources but afaik it has no effect when you remove the resource from your .tf files[1], so it would not have helped in this case.

[0] https://www.terraform.io/docs/language/meta-arguments/lifecy...

[1] https://github.com/hashicorp/terraform/issues/17599


This!

In addition to that things can be made even less error prone. Ive done this using yaml pipeline in azure devops. The plan task can be used to set an output variable which indicates if the generated plan contains any changes. That boolean value is used as a condition to trigger a manual verification task which basically prevents apply running if there are any changes without reviewing it first.

As the op mentions, the generated plan is an artifact itself that is used in a following apply task


I'm not very familiar with Azure anymore, but AWS also offers deletion protection on RDS and other services. This forces a two-phase operation with Terraform. You must first apply configuration to remove the deletion protection, then in the second phase you may delete the RDS instance.

Unfortunately, I've never been able to get much mileage from the Terraform prevent_destroy lifecycle option because it can't be set from a variable. Most of my configurations use a module and pass different variable values per each environment. I'd want the lifecycle flag in production, but maybe not dev.


I'd also think about whether you actually don't want this in dev. In the past, it's been helpful to have stage and dev environments be as close to production as possible, quirks with lifecycle and all.


Yes, Azure has a similar thing for resources where you can set a "CanNotDelete" lock on one. You have to delete the lock before you can delete the resource.


> Unfortunately, I've never been able to get much mileage from the Terraform prevent_destroy lifecycle option because it can't be set from a variable

There is a trick around this. Code 2 mostly identical resources, one with prevent_destroy true and the other with prevent_destroy false. Give both resources a count. You can make the count dependent on interpolation. And the count can be zero. So depending on the interpolation result you have resource(s) with prevent_destroy or without. For the "wrong" ones you just create 0 of them.

https://stackoverflow.com/questions/53727357/terraform-how-t...

https://stackoverflow.com/questions/53744441/how-to-refer-to...


That just means that enabling `prevent_destroy` later on will a) destroy the resource b) create a new resource with a different name and `prevent_destroy` enabled.

Plus the combinatorial explosion of all possible conditions.

Not a good trick.


Why would I want to enable it later? When I deploy production they ave enable_destroy. It's the same resource all the time and it must not be destroyed.

When I want to deploy a test system the resource can be destroyed after I am done.

Of course this is not really elegant code. It violates DRY, I need to make sure that my test resources have identical configuration as my production resources, except this one attribute.

Maybe your use case is different. I never change the prevent_destroy attribute during the lifetime of the resource.


That's a "trick" I saw in the wild in several community TF modules but as the other commenter says, it's not that clever because once you want to change that specific parameter, you will trigger a complete destroy+recreate action which as you might imagine is not that good in real production.


This is exactly what we do also. Apart from prevent destroy, you can also have termination protection enabled on the actual instance. Not sure if azure offers it but aws definitely does


“I” create dozens of ephemeral terraform deployments for every iaas every day using terraform so I don’t have this luxury. Do you have any suggestions for keeping tabs on automated systems beyond spend anomalies?


What was actually needed here was a Resource Lock on the Resource Group in question.


In my opinion, databases are not cattle, and don't need to be automatically created (and destroyed!) in your main Terraform plan.

It's perfectly OK to have a completely separate Terraform project that just configures the DB initially (or even manually, I see lots of places running DB's that predate Terraform with immutable infrastructure for everything else), and applies minor non-destructive changes in the future. This way you get the benefits of IaC, but the DB plan doesn't participate with the rest of your infrastructure that IS ok to blow away and re-create at will.

BTW, Amazon RDS backups work the exact same way: Destroy the database and the backups are also destroyed. Therefore, same region automated RDS backups are fine for day-to-day, but in a true "DB goes poof" disaster you should expect that you WILL lose them too! You need cross-region, or even better, cross-account DB replication or snapshots to survive this.


> You need cross-region, or even better, cross-account DB replication or snapshots to survive this.

Excellent point.

AWS provide a couple of decent options for cross account backups, the Aurora Snapshot Tool [1] and the AWS Backup service. I've used both successfully.

[1] https://github.com/awslabs/aurora-snapshot-tool


It's worth checking out "terragrunt" it helps one breakdown larger terraform sites into re-usable modules which can be ran independently without having to redefine common configuration such as backends.

It allows for running a plan or apply on the whole site, or each individual module in the site.

So in this case it's easy to just apply the database changes, and then run the whole site plan to make sure everything is indeed in the right state.


This is what I've always done.

Databases (or cattle, how or what ever they may be) are always created as a separate stack in your IaC of choice. The values you need for this stack are always outputted for consumption into other pipelines.

The benefit being you can make lots of small changes to your upstream stacks without breaking your database. And you can place additional controls around the database stack to prevent instance replacement from occurring.


> Databases (or cattle, how or what ever they may be)

It's a reference to the idea that you should treat your servers as "cattle, not pets". In other words, they're disposable and you should be fine with destroying one and creating a new one to take its place.

The parent is saying that it's fine to treat databases more like pets, where they're pampered and looked after carefully, and fed treats - much like DBAs are, I believe.


I don't know much about Terraform, but at least in Pulumi you can mark resources as protected to prevent accidentally deleting them.



You can do it both in Terraform and also at the resource level in Azure itself.


That's chock full of bugs unfortunately. For example, if you lock an Azure DNS Zone to prevent it being deleted, you then cannot delete any DNS record under it! It's a strict hierarchy, there's no way to turn off the inheritance.

(It is possible to create a custom RBAC role that excludes zone deletion only, but this is very fiddly and not-quite-the-same in complex ways.)


An Azure resource group structure along those lines that I've come up with is:

    APP-LOC-SHARED       -- wildcard certs, deployment scripts, etc...
    APP-LOC-ENV-Common   -- gateways, etc...
    APP-LOC-ENV-Data     -- databases, storage accounts, pets.
    APP-LOC-ENV-Web      -- cattle
    APP-LOC-ENV-App      -- back-end cattle


DBs are cattle for dev environments. However, i agree they are not cattle for acc and prod.


One of the benefits of creating databases in terraform or some other IaaS tool is having the DBA teams pre-define files to create databases along with the standard set of monitoring, logging, and other devops goodness. That way everyone gets the same set of “golden signals” monitoring and dash boarding for free.


All you have to do is set the final_snapshot_identifier and leave skip_final_snapshot off, right? Never done this, so I don’t really know.


Y'know, normally I'd be relatively forgiving as none of us are perfect, but jeesus this is a clusterf** of an article.

The author, a "Director of Consulting" might want to do some training (Both Hashicorp and Microsoft have free training).

Why in the hell would you have the same statefile for different environments? Why would you have these environments in the same Azure subscription? Why would you run your terraform so infrequently that you'd forget about a botched statefile move? Why would you not read TERRAFORM PLAN (It's LITERALLY WHAT TERRAFORM DOES)?

I also suspect that while the author's probably heard of a CI/CD pipeline, they're running their IaC from their local machine, given the tone of the article.

Why, for a production database that contains information not contained elsewhere, would you not configure an Azure Recovery Services Vault?

Like I get it, people make mistakes. I once did something similar with an overenthusiastic use of terraform destroy, but this guy just seems like an absolute cowboy who doesn't really know what he's doing.


The problem with Terraform and IAC is that there’s a big gap between learning how to use it and then learning how to use it in a safe, scalable way.

It’s the same with something like a programming language where there are thousands of best practices and foot guns, but infrastructure as code is more dangerous and much newer. There are also less books, venues and even training courses to learn these practices.

There are also a vanishingly small set of engineers who have actually done this in production at scale so it can be hard to find experienced people.


I've worked with at least several hundred of them over the past 5-10 years. Granted, I'm in Europe and the author is in the US, perhaps you are too. I'm not sure what talent/skills are like there.

In several firms I've worked in now someone missing so many things wouldn't be above consultant level, and wouldn't be approving PRs let alone deleting a live prod off from their local laptop without some significant questions being asked.

I get everyone must learn things, but when you position yourself as a technical expert (Director, in this case) you should have enough experience to be a bit more thorough with your work so if a mistake happens, there's a way out, or just not make what amounts to several design and implementation mistakes.

Part of what I think makes this a little egregious is the author didn't f** up his own systems, he f**'d his clients. I'd understand a little more if it were "I'm the in-house guy upskilling" rather than "look at this mistake I made as a (presumably) highly paid outside consultant literally brought in to make sure stuff like this doesn't happen".

However, I still must give absolute kudos for sharing mistakes publicly. We all do make mistakes, and most people try and hide it. When the author realised there was an issue, every step after that was handled like a pro.


Both Terragrunt[0] and Terraspace[1] address this by providing a way to manage different environments.

I do however think that Terraform should have a first class support for different environment without the need of a wrapper to facilitate that.

[0] https://terragrunt.gruntwork.io/

[1] https://terraspace.cloud/


Is what you're describing different from the idea of Terraform Workspaces?

https://www.terraform.io/docs/language/state/workspaces.html

> Certain backends support multiple named workspaces, allowing multiple states to be associated with a single configuration. The configuration still has only one backend, but multiple distinct instances of that configuration to be deployed without configuring a new backend or changing authentication credentials.


Normally I'd just read attacks on useful blog posts from anonymous accounts, but jeesus this is one cluster* of a comment.

You might want to actually read the entire blog post before going off on a tirade.

I am not OP btw.

> Why, for a production database that contains information not contained elsewhere

* This is a dev environment - the one that he uses to to dev work where its perfectly okay to corrupt, which is why dev environments exists.

> Why would you run your terraform so infrequently that you'd forget about a botched statefile move?

* Why on earth is it surprising to you that theres a dev environment sitting somewhere that could be unused for some time?

> I also suspect that while the author's probably heard of a CI/CD pipeline, they're running their IaC from their local machine

* It is perfectly okay to develop and test changes on a local dev environment before putting it on a CI/CD pipeline. OP mentions that there is an Azure Devops pipeline in place.

> this guy just seems like an absolute cowboy who doesn't really know what he's doing.

* In dev yes. He clearly does know what he is doing.

This of course would never happen with AWS and cloudformation changesets. OP is sharing gotchas that arise when using Azure and Terraform, this is useful.


Totally agree with you. And this makes me think about blogging about this kind of things. On one side I understand it can be "liberating" and also it normalizes accepting that we are humans, we do mistakes etc but when it's a so basic error, I will personally feel really ashamed by it and would probably learn the lesson and hide it under a rock, instead of blogging about it andd having it on the HN front page. There might be a few people/companies doing this level of TF bad practices out there that might find this helpful, but I also think that these kind of companies don't have employees reading random tech blogs to learn things and from others' mistakes.


Couple things:

1) It is so weird to me that every cloud provider deletes backups when you delete the SQL instance. We take offsite backups that are decoupled from this process. Fortunately someone made a shell script to do this that has worked quite well: https://github.com/ovotech/cloud_sql_backup (you can then just copy the Cloud Storage buckets to S3, so when you Google account gets banned you still have your database).

2) I hate to be "that guy", but I'm starting to wonder about gitops. I really, really like having all infrastructure changes recorded in machine-readable format with history. Pry it out of my cold, dead hands. But you also lose a ton of important tools, like the diff and the sanity check before you deploy. You could have a CI rule that does the diffs, but then CI has to touch production which is not ideal. You could have CI do a dry run of HEAD and a dry run against your PR, and diff those two, but honestly no CI systems really let you check out multiple branches and use them as an input to the script, so have fun hacking that up. You end up with workarounds for workarounds and the net result is that your auditable infrastructure comes at the cost of taking a human out of the loop (and slows down experimentation). I don't think the model is quite right quite yet.


I don't understand the problem. Use something like `terraform plan` as a comment on the pull request and manually look over the changes before pulling in.


I don't use Terraform but I think you're describing something like Atlantis? https://www.runatlantis.io/ I remember seeing a team running this at a previous company I worked at


Atlantis is just like the GP's mentioned: a CI to run terraform plan then a person needs to examine the dryrun output. But it's still not ideal because Atlantis also has persmissions on production, and a (human) mistake could brought down your infrastucture.


There is also documentation on how to integrate Terraform with Gitlab Merge Requests https://docs.gitlab.com/ee/user/infrastructure/mr_integratio...


Not even before pulling in - the state may change in between the comment and the applying. A safer way is to dump the plan from after the merge and not apply that specific plan until after it's manually approved.


Yeah, terraform cloud from hashicorp does this


> 1) It is so weird to me that every cloud provider deletes backups when you delete the SQL instance.

Not actually true for Google Cloud CloudSQL (at least MySQL). You can delete the instance and you don't see any backups any more in the Google Cloud Console, but they are actually there and you can restore from them. You need to know the instance name though.

I think the backups will be kept for 30 days. This is the same time you can't reuse the instance name.

Google could improve the UI about this a lot and display also deleted instances.


Is this documented anywhere? Or is it just happen to work and may break in the future?


Isn't the answer to 2 to have a pre-production environment to calculate the diffs/changesets/etc against, or even to deploy the change to to test it? I know with CloudFormation, for all its faults, is pretty good at making a hypothetical changeset before it runs, and if that's not safe enough, you can just deploy it to your dev environment and use the changes succeeding and your service being up as a integration test.


Sounds like you have no test or staging environment!

Just apply the changes against staging. Test everything is kosher. Then merge into master to apply to prod.


> honestly no CI systems really let you check out multiple branches and use them as an input to the script, so have fun hacking that up.

Doesn't seem hard? It would be a one-liner "git worktree" pre-build step in Jenkins, for example.


What is unideal about CI (actually CD) touching prod? In my view what’s unideal is humans touching prod. Most problems with “gitops” ime come from git itself which is terrible vcs for just about any project except the kernel and the fact that tooling is not fully idempotent or automic (eg terraform applies failing mid-way)


One of the big mistakes I see here is that the testing environment A doesn't look and work the same as the prod environment B. The closer both are, the surer you can be that you can catch such mistakes by testing in A. The bigger the differences, the more problems can sneak in undetected.

So while I think it is OK for the author to be a little humbled by his mistake, I would actually place the blame on the misdesigned environments. We should try to expect human mistakes and make them avoidable (by noticing them early in the test environment) or prevent them alltogether (by automatically testing and only allowing the change to production, but that is far harder to setup).


"One of the big mistakes I see here is that the testing environment A doesn't look and work the same as the prod environment B"

This is a nice aspirational goal, one that you should definitely aim for, but I have never seen a situation in a non-trivial app where the test and prod environment are even close to identical.

All large apps I've worked on end up with external dependencies that themselves don't have a test system and have real world side effects (like buying things), so straight off the bat, all of them need to be mocked out.

Then there's no way we'd get budget for a 1000+ node cluster on our test system, so naturally there are entire classes of high load / large network bugs that are less likely to crop up.

Finally the test systems are often not allowed to handle prod data for regulatory reasons. Thus, everything needs to be either synthetically generated or stripped versions of prod data, which unsurprisingly sometimes behaves rather differently.

Even if you think your test environment is a pretty good replica of prod, chances are it's probably not as good as you think. So you should always have processes in place to be able to slowly rollout deployments and halt/rollback if there are issues. Everything should always be backed up. And importantly, this should not all be controlled by a single deploy script with global permissions - anything that deletes backups should need explicit multi-step approval.


Indeed. Duplicating Test and Prod can be tens of thousands of dollars (or more) a month.

You have to invest big time in automation to build up and tear down any decent sized environment to save the spend there, and that itself costs money.


Over decades I've slowly convinced myself that the best approach is to split production into groups and do rolling updates.

One of the smoothest run environments I've ever worked with had ten silos. A failed update would not impact more than 10% of the users, and we could quickly redirect them to the remaining 90% of the platform without a material performance impact.


That definitely is an option, but viability depends on what your environment is doing: "Sorry, we lost 10% of your bank transfers due to an update" just won't be acceptable, but "we delivered 10% less catpics today" might be.

And I wonder how many platforms can be made to work like that.


Sure, they will be different in scale. But can't you keep them synched in all the ways that matter for TFA? Like your clusters will be different sizes but will have the same resource definitions, etc.


After working with declarative systems for some time now (terraform, kubernetes), I've concluded that things will generally be declarative, but you're bound to run into edge cases where the imperative nature of the underlying system bleeds through and shows up as a gotcha for some reason or other. For kubernetes, I've found that people tend to have to name YAML files in the order in which they should be applied for things to work properly. A more devious case involves kubernetes ConfigMaps getting updated in place via apply while the deployment fails to recognize that the environment variables for the pod have now changed and need to be restated. I suppose this could be chalked up to the idea that all declarative systems need to be functional and each resource should be immutable, however, that's never how it evolves in practice. In short, it's all been quite a disillusioning journey through the general promise of not needing to worry about ordering that eventually results in ordering being the hardest problem that the system faces.


Where I last worked, all terraform changes went through PR, requiring approval, after having read the plan. It was using a system called atlantis. It was slow, but it prevented issues like this.


Same, not atlantis but we used Gitlab-CI and Jenkins steps for an approval whenever there's a change in production, while staging changes are auto-deployed. Terraform plan was written to the PRs using tfnotify[0]. Normal deployments typically took 1 minute and 20 seconds (for each environment, in parallel) which I would consider very reasonable considering that we deployed a medium size infrastructure with only 2 terraform layers, so there was a room for optimizations.

[0]: https://github.com/mercari/tfnotify


This is the way.

(we use gitlab-ci's built-in review process for approval).

At the end of the day, approval's still a human job, and humans make mistakes. Right? :D

Terraform is an incredibly powerful tool, and you can make some monumentally huge mistakes with it.


Atlantis was actually created at my previous workplace by a couple of my ex-coworkers! Agreed that it’s a great way to bring a bit more care/rigour to always-dangerous infrastructure changes. IIRC we had it configured so that you had to always had to do things in this order:

- plan against staging

- get a PR approval

- apply against staging

- plan against prod

- apply against prod

- merge

Being forced to plan (and get someone up review said plan) before applying makes it far, far less likely you’ll do the level of damage described in this blog post.


From what I experienced, per-environment branches is a bad practice that eventually will be a big burden to deal with especially when environments don't match. Actually the concept of "staging" in infrastructure is different than it in code, which is the usual source of confusion.

The best strategy is to have a repository for your modules only so you can specify the version[0] you want to use, and separate environments by folders.

[0]: https://www.terraform.io/docs/language/modules/sources.html#...


Yeah, we just had a single feature branch, which we would merge into the single master branch. We’d simply apply it to staging first, make sure nothing terrible happened, then apply to master. All those steps I listed above happened on the same branch, same PR.


Atlantis is great. If you've grown beyond 5 or so engineers you should have no excuse to be running terraform apply from laptops.


It is completely embarrassing how many engineers we have and still apply manually from laptops. Changes are slow and error-prone, we don't even have them hooked up to CI/CD. I think it still works because we have so many damn engineers and we don't actually need to change infrastructure multiple times a day.

That said, Terraform breaks so often that if we did it all automated, we'd have a million more Git commits from trying to fix broken apply's.


Well, Azure prevented you from creating a RG but still let you delete it because that was the role that the client configured for you. AFAIK there's no built-in role that behaves like this, but it's probably a very easy mistake to make - grant permissions on resourcegroups/$name/* , forgetting that * includes delete.

re: SQL Server backups, I assume this was VMs running SQL Server rather than Azure SQL DB (the managed one) ? If it's the latter, then I think the backups will be retained even if the RG is deleted.


I mostly work with AWS, but have used Azure in the past and remember the Resource Group concept. It is quite dangerous when coupled with Terraform (or any IaC tooling for that matter).

AWS RDS has a feature where if you delete a DB instance it prompts you to take a final snapshot of the data. I haven't used Terraform in just over a year now (have been using aws-cdk), but as far as I remember, Terraform deletes of RDS instances would require you to add a 'force' argument/option to override the prompt for the final snapshot.

Something like this seems like a no-brainer for Azure SQL when deleting instances.

Also, with TF its always best to plan, write the plan output to a file, and then run apply against that plan (after having studied it!)


Azure’s support is great! I am not surprised they quickly helped out.

First time I contacted them I was not hoping for much. But after a few contacts I realize they must be one of the best support orgs in the world.

They can quickly help out with anything from very trivial questions to helping out with very difficult faults, and do this without having to be harassed about escalating my tickets. Outstanding.


I've had the exact opposite experience trying to get support setting up a VPN to connect an on-prem server with remote clients (using self-signed certs and terraform). I'm not very familiar with Azure and I was volunteering my time/expertise but the support (paid but basic plan, I think $100/mo) was very underwhelming. The support technicians were all Indian and while they were polite and had good English, the time difference made the back-and-forth painful.

Spent over a month trying to resolve the issue but I must've been talking with tier 1 support because it was obvious the support technician wasn't very technical (asking very basic questions that I had already answered, couldn't answer any of my questions without getting back to me days later). Combined with the quirks and performance issues (30+ mins to setup VPN gateway) it was a very poor experience and I wouldn't choose to work with Azure in the future.


Doesn’t terraform show you how many resources are affected and require you to approve that when running apply?


What happened is OP imported an AzureRM resource group into a terraform resource, then "deleted" the resource by changing to a data block. When you delete a resource group in AzureRM it deletes every resource under it. This behavior is pretty stupid when mixed with terraform but as a rule of thumb if I'm not importing every resource under a resource group into the state then I make it a data call from the start. The missing step was to `terraform state rm <addr>` on the resource group and then change it to a data block.


Sounds to me like a bug in azure provider. It's job supposed to be understanding the environment and knowing what will happen for any given action.


AFAIK terraform doesn't have a way to inject user interaction on provider-level operations.

Anyone know if az cli or powershell module prompts users before it deletes a resource group?


I didn't mean injecting an interaction.

What it supposed to do is when was about to remove resource group, it would still see that other resources are relying to it and not to remove it.


It does, but according to the post it was run via Azure DevOps. I'm not familiar with ADO but it sounds like it might not have been as obvious as running locally. Alternatively, perhaps it would have only shown deleting the Resource Group, but internally to Azure(and unbeknownst to Terraform) this means deleting everything within the RG as well.


We run our Terraform pipeline via Azure DevOps. We separate the plan/apply steps, stash both the binary and json version of the plan, stash the logs of the plan, and require increasing levels of approval as it moves through each environment on it's way from dev to prod. To go to prod it needs 4 people to have reviewed the plan - the person whose PR triggered the release, a director, a team lead, and another engineer. This is enforced using Azure DevOps' baked in pipelines and approval functionalities. To keep from nuking things we completely disable destructive operations from our AWS/Azure Terraform user and in fact, limit what the user can do as much as possible, only giving it the permissions needed to create what we use, and adding as necessary. For deletes we add a very targeted permission that we revoke after deployment.

It can seem onerous at times, but ultimately it's a far smoother process than legacy manual or scripted deployments and is very repeatable and visible. It's also a good thing to have lots of eyes on critical infrastructure changes anyway.


Why stop at 4 people though, and only director level?


I realize you're being sarcastic but we're contractually and lawfully obligated to meet that.


.. unless you add the `-auto-approve` flag.


I'm not sure about Terraform, but Pulumi has an options object that can set parameters about how Pulumi interacts with that resource. I tend to enable `protect` on database resources which prevents Pulumi from deleting it.

AWS has done a great job at making it difficult-to-impossible to tear everything down in a single command. There are rate limits on resource deletion and no single place to remove items in bulk if not managed by Cloud Formation. I spent the better part of a couple days trying to clean up an account from all of the random supporting services that get created over the years. If I'm wrong about this, please let me know!


Terraform supports this by setting prevent_destroy = true in a resource's lifecycle, which I discovered after I accidentally vanished an entire Elasticsearch domain—fortunately, though, not prod.


It sounds like the fundamental problem here is that the account credentials for environments are shared. You shouldn't have permission from one environment to modify the assets of another. Every environment should have its own segregated account ID. Then you can't import the wrong resources. The state files should be segregated as well, so that when you're running Terraform on Account A, it can't even see the state file for Account B, much less mix them up or compare them to the wrong environment.

It sounds like the user running Terraform had access rights to multiple environments. This sort of thing was inevitable.


I keep databases and other single data source constructs out of terraform. Sure, there are mechanisms to prevent deletion but all it takes is one bug (and terraform has many) for all your backups to tear down.

I also infrequently backup single sources off cloud to prevent ransomware payoffs. It has the secondary feature of providing a recovery source if things go off a cliff in the production environment.


>I keep databases and other single data source constructs out of terraform

Glad I'm not the only one that saw this as the right choice.


> but the sheer magnitude of carnage resulting from a one word change in terraform was surprising

You are giving a relatively new tool full access to your datacenter. Here be dragons. This mistake is obviously something that would never happen when using the official UI, at least not without descriptive explanations and warnings in the process.


This wouldn't have happened with a separate state file. Additionally, as said below, plan + apply must be separated with a ManualValidation task to allow release administrators to manually inspect the plan output on a production deployment.

Terraform's "-detailed-exitcode" will help you tailor the pipeline if there's nothing to do (0 - plan success, no changes, 1 - error, 2 - plan success, changes).

I'm also of the opinion that prod/non-prod subscription and resource groups pairs should be "vended" to you by another department (i.e. you're a user not an owner). This prevents you accidentally destroying the entire hierarchy.


Blaming Azure here feels quite disingenuous when there are so many layers of breakage in the approach (manual apply, TF state across subs, a hack to remap a resource, assumptions about how databases are handles, etc.)


Not entirely related but I'd also recommend thinking hard about your resource groups in case you ever need to move or re-create them. Due to the way azure and terraform works, it's "safe" to put all your stateless resources in one resource group, but everything stateful should be in individual, small resource groups. This will be a life saver if you ever want or need to move your stack across regions using azure's built-in georeplication rather than restoring from slow off-site backups.


> but everything stateless should be in individual, small resource groups

I think you meant stateful.


Thanks, edited!


First, I recommend to have different subscriptions for each environment. Second, the thing that is really nice with Terraform is that you can see what it plans to do before it does it. One can also set it up in the CI/CD pipelines to align it with other change processes.


> I will cut corners and do a terraform plan on just one of these environments and then terraform apply on all of the others.

We don't do this because preproduction is being used by the business actively. It is treated like production environment. But it depends on companies.


We had the same issue with DynamoDB, that Terraform is all to happy to delete a table it deems acceptable to delete. Which in 99 out of 100 cases is never the case. It is a very aggressive IaC tool.


It's a very powerful tool. Just like any other (well designed) tool, it's only going to do what the operator tells it to do.

Terraform will never delete something you don't tell it to.


> Terraform will never delete something you don't tell it to.

Not necessarily. With some providers (e.g. Azure), Terraform will fail to recognize automated behind-the-scenes changes and try to revert them, causing serious breakage. This is why the "ignore_changes" meta-argument exists. See https://itnext.io/how-and-when-to-ignore-lifecycle-changes-i...


Instances like this is where CDK is superior to Terraform imho. There are many resources which are difficult (or impossible) to delete or modify once created with CDK, and for good reason. Secrets also fall into this category.


We also ran into issues with the CDK, that it errors out if the resource already exists and was not created with the CDK. We end up using Ansible for the DynamoDB. Though Ansible can't deploy to LocalStack. So, it seems that we are still in a world where you have to use a couple different IaC tools depending on the needs.


What's the use-case for creating resources outside of--but also defining them in--CDK in this case? That's generally an anti-pattern.


It's actually not an anti-pattern. CDK provides many methods (e.g. `StringParameter.valueForStringParameter`) for accessing deployed resources from other stacks, or those deployed by other tools or manually. IaC can get in the way or that, but it's not forbidden, nor discouraged.


You could tag everything you care about with lifecycle rules to not delete but it'd be nicer if they were the default for certain resources.


all I can say after using all the various systems is that borg /google3 is still the best resource manager I have ever used. It's not impossible to prevent mistakes but there are a number of very powerful things due to the monorepo and deploy-at-HEAD approach that make it possible for many teams to provision resources within a limited number of accounts.


"immediately raise a ticket with Azure Support, who were able to grab the resources from “somewhere” (I guess when you delete a resource in Azure, it’s still on a disk somewhere, for a while), and we got our database and backups back"

what the hell is Azure doing pretending to delete things?

How long do Azure keep your data after you think you've deleted it?

Is there a way to ensure data stored in Azure is actually destroyed when you ask for it to be destroyed?


I can't speak to any specifics, but asynchronous behavior, including deletions, is very common in large distributed systems, and Azure I'd wager is no exception.

Update: I'd forgotten that even filesystems often do a soft delete. https://lwn.net/Articles/462437/


A bunch of high-profile resources have temporary soft-deletes - resource groups, managed SQL DBs, KeyVaults, Storage Accounts, and even subscriptions themselves. For some of these the undelete option is given to the user, for others you have to call support.


ah ok, phew! I guess the OP just didn't know about that - they seemed surprised it was recoverable!


Well it's not documented for resource groups, so their surprise is expected. The fact that it applies to resource groups is based on empirical evidence.


It's also not documented b/c Azure and AWS don't want to give promises about this being possible.


Tables in Azure Storage are an example of this -- the delete is performed asynchronously by a background thread some time later (just garbage collection). The table is not immediately deleted. I don't think you can force Azure to immediately delete it, but you could possibly raise a support ticket to ask them to delete it immediately. Have not tried this though.

Part of the reason is performance, I would assume, since a large delete could be hard on the overall system, and could slow Azure down for you + other customers. Also because some customers accidentally delete critical resources sometimes. (Or forgot to copy down any important config options from the resource before deleting it. I have some experience with that mistake.)


Encrypt it and lose the key is generally effective.


It is destroyed. But i believe there is a timeline where azure support can recover it for you until the deletion is final.


you should have a look at terraform workspace feature.


workspaces are specifically _not_ intended to be used as environment replacements. And it wouldn't have caught the OPs issue to begin with. His problem was he didn't read the plan, because it's pretty basic that any change of a resource name or a change of type equates deletion and TF will report back saying it will destroy a resource that is missing in the config but in the state.

To solve his problem, after switching to the data source type from resource he needed to manually run `terraform state rm` to get rid of the resource from the state.


> workspaces are specifically _not_ intended to be used as environment replacements.

From Terraform's "An Overview of Our Recommended Workflow"[1]:

  "The best approach is to use one workspace for each environment of a given infrastructure component. Or in other words, Terraform configurations * environments = workspaces."
So 1 workspace != 1 environment but workspaces are indeed intended to handle multiple environments.

I've personally struggled with managing multiple environments using Terraform so I'm interested in what the best practices are.

[1]: https://www.terraform.io/docs/cloud/guides/recommended-pract...


Can you elaborate on why terraform workspaces are not intended to be used for environments? I use it exactly for this use case. Same terraform config deploying to production and staging, each maintaining its own state in a separate workspace.


There's a song in Hamilton that I'm always reminded of when terraform applies run:

"Blow Us All Away"




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: