Terraform at Scale

asenchi · on Oct 15, 2020

This is frighteningly complex and I would not recommend this for reliability concerns. Terraform makes it quite difficult to maintain a structure like this, mostly due to the inability to interpolate module versions.

I highly recommend anyone using terraform start with a mono repo. Still use modules and then spread out your components but definitely don't try to split them up in separate repos because keeping the version graph together will become a real pain and is easy to mismanage.

From there, ensure any logical components have their own separate lock and state file. If you don't do this, it is going to be a game of roulette to know when this resource was last run and what might change if you roll it out. It is easy to build your own directed graph between components using data sources. These allow you to have component B read component A's state file to get the outputs it needs.

All in all, I've found most of the documentation and "best practice" in the terraform community promoting dangerous practices. I talk more about these practices in this podcast[0]

[0]: https://packetpushers.net/podcast/full-stack-journey-027-und...

Rowern · on Oct 15, 2020

Having worked with Terraform for 3 years, I totally agree!

Hashicorp should put more money into promoting "healthy" terraform tutorials based on real world usage (maybe split for small/medium/large orgs)

My first setup was split into 10-15 repos and it was an nightmare...

Now I have a mono-repo (and some GitlabCI magic to handle different projects) + terragrunt and it so much more stable than my first setup! As with everything, start simple, only change if you hit a wall!

bosie · on Oct 15, 2020

Do you mind sharing your strategy for module versioning in a mono-repo? anything you encountered that really didn't work?

tilolebo · on Oct 15, 2020

I agree as well. Hell, even the vscode official extension doesn't support multi-folders workspaces.

ldoughty · on Oct 15, 2020

I used to love Ansible/TF control of Infrastructure., But honestly, after 2 years of Ansible, we considered switching to TF, but ended up with SAM (an AWS deploy tool on top of cloud formation)

The only benefit I saw with Ansible/TF was controlling things outside AWS... Like running shell script, or connecting to servers.

Since we swapped to SAM, we no longer get breaking changes from minor patches... Features are available almost instantly, and we can easily modularize and deploy in parallel (we often do 30+ simultaneous deploys without issue).

I still don't claim TF out Ansible are useless, just that they are fixing a problem I don't have... I'm not multi-cloud.. and all the BS that was sold about these tools being easily transitioned from one cloud to another is junk. You can't just pick up lambda, change 3 lines, and it drops into azure.

tilolebo · on Oct 15, 2020

> Features are available almost instantly

If SAM relies on Cloud Formation, there must have been some serious changes at AWS. CF was/is notoriously lagging behind AWS console and API in terms of feature coverage.

The Terraform AWS provider has always been much faster to reflect the AWS API changes.

irjustin · on Oct 15, 2020

In agreement with you, SAM's support in CF and CDK is solid. AWS really put all their ducks behind SAM.

The place that suffers is ElasticBeanstalk. I used to be on EB but couldn't take it anymore. I kept having to run mini-templates from inside the beanstalk machines to get access to specific items that weren't exposed to CF directly.

Once I moved to ECS/Fargate, it's worlds better support. ECS still has some parts that are laggy in the CDK (and you can still do it via JSON obj) but not in CF. It's been pretty good here in Fargate land.

idunno246 · on Oct 15, 2020

If I recall correctly, a couple years ago aws switched from one central team owning all of the cfn implementations to pushing it onto the service owner, so it’s no longer bottlenecked on a single teams queue. Given that the serverless teams are all invested in sam, it follows they’d make sure the things are implemented

nowahe · on Oct 15, 2020

Not especially true for all AWS Services, there has been some services that were only available on CF (Workspaces comes to mind).

But in this case it wasn't Hashicorp fault, as what was available in CF wasn't in the AWS official API.

It took something like a year for AWS to release the API

erikerikson · on Oct 15, 2020

You might want to check it the Serverless Framework [0], the inspiration for SAM, which adds variables and a lot of flexibility.

[0] https://www.serverless.com

daitangio · on Oct 15, 2020

My setup is quite simpler. I have created different environments based on terragrunt, i.e. to separate prod and no-prod environments. Then I have some "macro modules" with separated state files.

I try to avoid too much state files, so for a medium sized setup on AWS I got:

- Network Module - IAM Security module (which is region-free) - Application modules (for instance one for specific topic: my docker cluster, my S3 buckets and so on)

One macro module can load other one state outputs, so for instance network module define default vpc you will likely use in the other one.

Secrets are variables you must put in TF_VAR during manual apply, so no password is stored in terraform files.

pizza234 · on Oct 15, 2020

I wish "Separation of Concerns" applied (cleanly), but at least in my personal experience, it doesn't, as AWS services end up being widely coupled - nobody's at fault here, I believe it's just the nature of the systems.

On a tangent, something that would improve TF team work is to simply split state locks into R/W; right now, two read requests (ie. `plan` operations) can't be applied at the same time on a configuration.

ciguy · on Oct 15, 2020

If you are having issues with multiple concurrent state reads you should definitely consider splitting your project into smaller discrete modules. I see a lot of projects where an entire AWS account is managed by Terraform using a single state file. This is insanity in my opinion.

It does simplify architecture in a way because everything can be referenced directly in the same projet, but you quickly run ito collaboration issues and runtimes get really long. Use logical states for each discrete piece of a service/module and have those output the needed config/info. Then reference those outputs from other states/modules.

This is the only way I've seen to really scale a TF project effectively. It's not perfect but it's workable even for large teams.

user5994461 · on Oct 15, 2020

If you split you get the issue that you can't reference other resources easily.

Collaboration should be a non issue, the few people who manage the infrastructure should agree on the work and plan together.

Unlike software code that can be merged later at no risk (a merge conflict at worst), infrastructure does not lend itself to improvising and experimenting on the fly (try to accidentally change your VPC or delete your databases and see how the company goes).

Terraform is something that really highlights when there are multiple people/groups in organization trying to steer the ship in a different direction.

In my opinion ansible/salt do a much better job for often changed resources like instances or services, because it doesn't need to maintain states, it can lookup and match existing resources (mostly by tags). Terraform is better for fixed resources like VPC/networking.

tilolebo · on Oct 15, 2020

You can very easily reference resources by using an AWS data source (gets resource info directly from the AWS APIs) or using a terraform remote state data source (gets info from the relevant terraform state file outputs).

Another way is to use a KV store, like AWS SSM Parameter Store or consul. This is useful to link your infra to your CI/CD or even directly to your application

tlarkworthy · on Oct 15, 2020

You can reference external resources by name and not bring the resource into Terraform domain (or manage by different Terraform project). I have not had a problem running multiple Terraform projects and sharing a few resources (e.g. SQL database)

hnlmorg · on Oct 15, 2020

> If you split you get the issue that you can't reference other resources easily

You can it's just non-obvious. What you do is output identifiers in one module and then import that state file as a data resource in another.

asenchi · on Oct 15, 2020

It is really frustrating that this isn't obvious. One of the best features of terraform hidden by poor/unreliable practices and "getting started" guides.

ciguy · on Oct 16, 2020

There are multiple ways to reference other resources including by state or as a data source. But Hashicorp/Terraform don't do a good job of making this clear.

pizza234 · on Oct 15, 2020

We do have multiple discreet modules indeed (each with its statefile), but multiple devs can still work on one at the same time.

I think separating configurations is important, however it also comes with downsides (which, again, don't detract from the value of separation), making it necessary to use remote states and data sources, in order to reference resources in other configurations.

ciguy · on Oct 16, 2020

Yes there is no magic solution and all approaches have downsides. I tend to lean towards the side of more separation at the cost of increased reference complexity.

dblooman · on Oct 15, 2020

One thing I would like to hashicorp do is take pull requests more seriously address them within a certain time period and help new Go engineers with tips on how to finish their PR's for terraform providers. The AWS provider is an example of lack of engagement, leading to stale PRs. At the same time, hashicorp love to shout about day 0 support for a big cloud feature.

pizza234 · on Oct 15, 2020

I have experience with contributing to both Terraform core, and a provider.

The core community maintenance is one of the most responsive and efficient that I've every worked with. Therefore, if they love to shout about day 0 support, I think they have the rights.

The provider project I've contributed to, or better, tried to, has been a complete trainwreck. Many devs discussing on an issue, maintainers ignoring the issue for a long time, then maintainer stepped in with a fundamentally incompetent comment, then disappearing again. This is odd, because the provider is maintained by Hashicorp employees.

There are polite and communicative ways of explaining why it's not possible to work on a fix/feature; that maintenance was far from this.

I suppose that they Hashicorp has a specific plan of maintaining the core very actively, at the cost of doing very poor maintenance on the/some providers. I wouldn't be suprised if they put their less important employees working on the providers.

tibbon · on Oct 15, 2020

I concur that provider projects are a mess. When I've reported issues, I've been told "That's not an issue, just do your plan/apply manually in two steps" which is pretty contrary to what you should need to do with Terraform. So many providers github repos have it set to automatically shut down an issue if no one commented on it for 30 days, which means a lot of things that are unaddressed get swept under rugs.

lukeschlather · on Oct 15, 2020

> I've been told "That's not an issue, just do your plan/apply manually in two steps" which is pretty contrary to what you should need to do with Terraform.

Is it? I've been contemplating using Terraform for my org, but I would really like a CI system that generates a plan, lets me review it, and then I can approve it and the plan gets applied. Ideally I think this is how I would like all of my infrastructure to work. (It's how I do changes to Kubernetes stuff; generate some resource manifests then apply the files.)

tibbon · on Oct 15, 2020

With the Fastly provider, you cannot create and define lifecycle on an ACL at the same time. Trying it will make the plan blow up. So you have to split it into one commit, plan/apply it, and then do another. If you tear down the entire system with a Terraform destroy, and try it from the ground up - it will fail. This makes testing with something like Terratest nearly impossible, as it will require multiple plan/applies in a row to get the environment up - and it's hard to determine what failures are retryable and aren't.

This sounds far from optimal to me.

tthayer · on Oct 15, 2020

This is supported by Terraform Enterprise deployment policies.

caymanjim · on Oct 15, 2020

Their 30-day auto-close policy is ridiculous. I've run into at least half a dozen bugs in Terraform providers, dutifully gone to report them, and found them auto-closed by the bot. How are they ever going to be addressed with that attitude? At least acknowledge that you don't care and aren't going to fix them.

phrotoma · on Oct 15, 2020

Yesterday during Hashiconf Q&A for one of the Terraform sessions somebody asked what the plan was for supporting more providers and the answer was basically "We want to get better at supporting the community to maintain as many providers as there are API's"

Which anecdotally anyway, it seems like they're putting some effort into. I started using the community built Auth0 provider in production last year and have seen Hashicorp community engagement people wading into the repo offering pointers and helping get the support / quality level up.

leetrout · on Oct 15, 2020

> I wouldn't be suprised if they put their less important employees working on the providers.

No, that is not the case from what I saw when I worked there. At least one provider was supported by a product team so there was a tension between working on the product or working on the provider and it was easy to lose context on the latest changes another team may have contributed. Contrast that with, say, the dedicated provider developers for AWS where they aren’t pulled in multiple directions.

caymanjim · on Oct 15, 2020

Why should Hashicorp hand-hold "new Go engineers" through PRs? If you don't know how to swim, stay out of the deep end of the pool. If someone contributes a complete, tested, documented, useful PR, that would benefit the community at large, I suspect it would be treated with the respect it deserves. It's not on them to teach people the basics.

dblooman · on Oct 15, 2020

If you take the time to write a PR for a project, chances are you have taken the time to read the contributing guide and have tried to submit a passing pull request. There are conventions that may apply to a massive code base, such as those on some of the big providers that not initially clear, so I think spending some time in ensure that you explain why they haven't had a PR merged is worthwhile. Sometimes a PR can be ignored completely.

I guess it depends on whether you want people to try and contribute to your project or whether you insta close the PR because they missed a semi colon, which it seems would be your approach.

AtlasBarfed · on Oct 15, 2020

Well, I would say in this case because it's fundamentally an "integration ecosystem" platform. The success of terraform is dependent on its deep support of cloud resources, across multiple cloud providers.

Also, this isn't a space they own. It is chock-full of established players (Puppet, Salt, Ansible, Chef).

Hashicorp thus has two choices: rapidly fix/implement requested improvements, or handhold to get as many participants in the ecosystem as possible.

Otherwise, TF will be a spark in the pan, superseded, or restricted to niches. Where I work it already is being relegated only to security groups and network resources, despite armies of stateless HTTP API servers in our microservice smorgasborg.

The real question is, why aren't the multi-hundred-billion-dollar cloud providers ponying up resources for these platforms, which really are the only way to force multiply and scale up on them?

caymanjim · on Oct 15, 2020

Yeah, AWS would be well-served to own the provider themselves. Sure, they would prefer people use CloudFormation/CDK, but they still get vendor lock-in if they support Terraform. In fact, they'd probably get even more vendor lock-in. As it stands, the Terraform provider is full of annoying bugs. It's still very usable, but if you dive deep, especially with newer AWS services, you're in for some headaches.

pweissbrod · on Oct 15, 2020

Isn't this article essentially what terragrunt accomplishes

tlarkworthy · on Oct 15, 2020

Personally I am not a fan of giving everything its own resources in modularized packages. It makes operations difficult as you have multiple pages of resources. Keep if flat and share things like buckets when the security and retention policies are the same. I think this is fine up-to team scale, and I am not sure you want to go bigger? I am not sure module reuse is that useful for Terraform.

I have converged to copy and pasting from some reference examples e.g. https://github.com/futurice/terraform-examples (note I contribute to that)

dmead · on Oct 15, 2020

I can't say enough good things about terragrunt.

skyzyx · on Oct 18, 2020

We have dozens of completely independent applications made up of hundreds of micro-services comprised from mergers and acquisitions, different market segments, etc.

Going from wildly disparate stacks to a unified core, while improving reliability and _real_ DevOps/GitOps/SRE principals (which are very different from old-school System Operations practices), we've independently adopted a model with elements of this.

> This article describes a systematic way of applying terraform at scale. At scale refers to: > > 1. high technical complexity: deploy infrastructure to any number of accounts and cloud providers > 2. high organizational complexity: enable multiple teams of developers to work collaboratively

It's making sense of the complexity that requires strong adherence to _reusable patterns_. We also use Terraform for "Monitoring-as-Code", so we use it in non-cloud provider use-cases as well.

Avoiding monolithic Terraform does come with an additional layer of complexity -- absolutely -- but we also don't have cookie-cutter services. We modularize the reusable bits, adopt _dependency injection_ (as opposed to tightly-coupled modules), and have adopted a more _composable_ approach rather than building large monolithic modules that do all the things. There are similar concepts in more mature software engineering organizations (e.g., SOLID principles). Updating a Lambda should not require executing the same monolithic Terraform as what manages your RDS instance. So we break apart the execution of Terraform into chunks which solve a single orthogonal problem.

Every module is well-tested using Terratest (Golang), and we leverage Terragrunt to solve issues with state management, reducing interdependency, and being able to use the output of one unit of Terraform as input to another unit of Terraform _in an entirely separate run_. This helps de-risk our infrastructure changes, and allows infrastructure to deployed _entirely separately_ from the application containers, while still leveraging things like ECS Capacity Providers in AWS (which uses container traffic to dynamically scale the underlying EC2 instances).

But for this to be successful, we've had to democratize the tools. We will never have enough headcount to manage all of the day-to-day ops work for all services. So we've adopted what Netflix calls "full cycle development" where dev teams have taken over day-to-day operational needs (deployments, AMI updates), which has freed-up our SRE team to focus on tackling the harder R&D reliability projects and providing solutions as software. It also moves us from a "man-to-man" model to a "zone-defense" model (to use a sports metaphor). It allows us to scale our team resources better.

Development happens in the open (internally) with public (internal) repos, and we encourage discussion, issues, and healthy disagreements to flourish (as long as those disagreements can come to a resolution -- we don't leave things hanging for too long). It becomes an interactive experience between dev teams and SREs to _work together_ to make sure that what we're building is solving real problems that engineers face.

We're not perfect, and there is always more to do, but this has been an absolute godsend to helping us improve how we do things at our scale and with our resources. By building best practices into the base AMIs, shared Terraform modules, Monitoring-as-Code, and other tools, it helps the overall engineering culture move in a healthier direction. We've been able to _dramatically_ cut our AWS costs as a result, leading to more jobs — even during the pandemic!

For anyone reading this blog post and kinda freaking out, I would recommend setting aside your biases and thinking through how a different kind of model may actually be able to _help_ you. Because it worked for us.

skyzyx · on Oct 18, 2020

Oh, and `aws-vault` from Segment.io.