Where I last worked, all terraform changes went through PR, requiring approval, after having read the plan. It was using a system called atlantis. It was slow, but it prevented issues like this.
Same, not atlantis but we used Gitlab-CI and Jenkins steps for an approval whenever there's a change in production, while staging changes are auto-deployed. Terraform plan was written to the PRs using tfnotify[0]. Normal deployments typically took 1 minute and 20 seconds (for each environment, in parallel) which I would consider very reasonable considering that we deployed a medium size infrastructure with only 2 terraform layers, so there was a room for optimizations.
Atlantis was actually created at my previous workplace by a couple of my ex-coworkers! Agreed that it’s a great way to bring a bit more care/rigour to always-dangerous infrastructure changes. IIRC we had it configured so that you had to always had to do things in this order:
- plan against staging
- get a PR approval
- apply against staging
- plan against prod
- apply against prod
- merge
Being forced to plan (and get someone up review said plan) before applying makes it far, far less likely you’ll do the level of damage described in this blog post.
From what I experienced, per-environment branches is a bad practice that eventually will be a big burden to deal with especially when environments don't match. Actually the concept of "staging" in infrastructure is different than it in code, which is the usual source of confusion.
The best strategy is to have a repository for your modules only so you can specify the version[0] you want to use, and separate environments by folders.
Yeah, we just had a single feature branch, which we would merge into the single master branch. We’d simply apply it to staging first, make sure nothing terrible happened, then apply to master. All those steps I listed above happened on the same branch, same PR.
It is completely embarrassing how many engineers we have and still apply manually from laptops. Changes are slow and error-prone, we don't even have them hooked up to CI/CD. I think it still works because we have so many damn engineers and we don't actually need to change infrastructure multiple times a day.
That said, Terraform breaks so often that if we did it all automated, we'd have a million more Git commits from trying to fix broken apply's.