Hacker News new | past | comments | ask | show | jobs | submit login
Terraform 0.15 General Availability (hashicorp.com)
245 points by myroon5 on April 14, 2021 | hide | past | favorite | 234 comments



So much hate ITT for such a great tool. Terraform is definitely not perfect but it's still one of my favorite tools, solely because of the amount of efficiency gained from learning it. Yes HCL is not perfect, but it is definitely adequate for a lot of applications. IMO, Hashicorp makes some of the most well thought out tools and I am grateful of their attitude towards open source.


ITT = In This Thread. ( I had to look it up )


I’ve preferred cloud formation for many years. the amount of time I spend debugging problems related to TF is infuriating. at least with aws.


The amount of tooling that has sprung out around managing large CloudFormation deployments is exactly the reason TF exists. CloudFormation is an untenable mess.


No, it’s just that it’s not fun to learn.


Terraform makes me happy, I use it daily on AWS and GCP. I can't stand Cloud Formation.


On a related note, CDK for Terraform allows DevOps practitioners to use a variety of programming languages instead of HCL. I've really enjoyed modeling my AWS environments with Python using Terraform only as the engine.

More info here:

https://github.com/hashicorp/terraform-cdk


If you like CDK, then i highly recommend pulumi


Same here. Pulumi got some things spectacularly right. Their Crosswalk for AWS [0] is magical, especially if you're building serverless apps using Lambda or Docker-based services on top of EKS/ECS. I'd suggest sticking to the TypeScript runtime and major clouds where the support is the best.

_Full disclosure_ - Marcin at Spacelift here, and we chose to support Pulumi in addition to Terraform, so this opinion is necessarily biased.

[0] https://www.pulumi.com/docs/guides/crosswalk/aws/


Can second that. Pulumi with Typescript is just absolutely awesome.


I don’t get why people like scripting it, declarative is fantastically simple. I don’t want to trace through loops, if statements, functions It does from these files declare my infra to these scripts when parser thru define my infer

I get Logic in yaml/json is less ergonomic but templing makes us for it


I suspect you haven't tried Pulumi. It is declarative. Sometimes I just want to generate the declarations rather than type them out by hand.


I was more replying to those saying these products now offer a typescript feature


It's probably good when you are already using Typescripts in other parts of your day job. While I like the idea of using a real programming language in place of a declarative DSL you trade in the awkward parts of the latter with the idiosyncrasies of async programming and promises.


I specifically chose Typescript for my IAC because of the outstanding static typing among all scripting languages.


Pulumi needs to work on their documentation. I looked at it about 18 months ago and the Python support was barely there.

Tried it again last month and it's better, but I had to do a lot of trial and error. ie. (Try to figure out how to provision an EC2 instance with a root volume. They copied part of the docs from terraform but sections are missing.)


I tried Pulumi with Python and kept running into show stopper bugs.


I'm currently about to switch to pulumi/python (we're a python show).

What showstoppers did you run into? Any insight welcome if that prevents me from hitting a wall I've not seen yet :)


My first project was trying to deploy a Redshift cluster. It flat out didn’t work. Did 2-3 calls with their tech sales before giving up on it. And then cdktf came out shortly after that.


How do they differ?


So regular CDK is basically a program-driven CFN generator.

Pulumi has a similar model where you build a resource graph at runtime BUT it's also got the execution engine built-in to the tool.

What this means in practice is that you can create resources (like a kube cluster) and then use them as providers (e.g provision state tracked resources with kube api) all in the same operation.

You can also (in your infracode or an importable module) define "dynamic providers", meaning you can easily extend the execution engine to do custom things related to your workload.

As an example, imagine you want to create a cluster, deploy an app, then provision (state tracked) some app resources like an admin user and group via the app's REST API. You can do that without too much fuss.

Neither terraform nor CDK can really do those things very well. TF is not powerful enough language-wise, and in CDK the execution phase is locked away from you.


I don't think I 100% understand.

Could you elaborate a bit more?

Let's say I have an app with API Gateway, Lambda, and DynamoDB.

I would provision them with one of those tools CDK or Pulumi.

How would these tools differ in their provisioning steps?


You wouldn't see much _practical_ difference between CDK and Pulumi (or Terraform!) for that use-case. The workflow would feel almost identical.

Under the hood different things are happening.

Your CDK program would run with all the resources you declared. That would generate CloudFormation script(s) that get submitted to the CloudFormation service for evaluation. The CloudFormation service (running inside AWS) is the "execution engine" and is responsible for creating and tracking the state of your resources.

Pulumi would run your code, build an object graph, then do the "execution" itself - invoke all the required AWS API calls to create your resources (the API Gateway, the Lambda, etc etc) from where the CLI is running. The CLI would also be writing out all the resource state to somewhere.

The tradeoffs are in line with what you might expect. The Pulumi approach is more powerful, but you "own" more of the complexity since it lives on your side of the responsibility boundary.

Some people prefer AWS to be the execution engine; they feel it's more reliable to let AWS provision resources and keep track of state. They like it that AWS is responsible and will fix bugs.

Others prefer the increased control of "owning" the execution engine. This means being able to debug it or extend it with third party or custom providers that let you provision new resource types. They're happy that they don't need to wait for AWS to fix things, they can do it themselves if they have to.

This is not the only difference between the two tools but it is one of the most fundamental ones.


Ah, thanks for this explanation!


Nice.

Does Pulumi support stuff like Cloudflare Workers, FaunaDB, or Auth0?


Workers yes, Auth0 yes, FaunaDB not built-in but you can add it (there is an experimental provider someone made).


>"So regular CDK is basically a program-driven CFN generator."

What is CFN here? Cloudformation?


Yes.


+1 to Pulumi.

xyzzy123 already described the differences between Pulumi and terraform, but I want to add one key way in which they are similar:

Pulumi uses terraform under the hood. We get all of the reliability of terraform, but with a much more powerful runtime engine.


I guess it depends on what you mean by "under the hood". As far as I know it doesn't use Terraform during runtime but it uses the Terraform resources for generating language definitions. It has a lot of interoperability tools as well such as a "terraform bridge" and a tool that converts Terraform projects.


Thank you for clarifying -- I didn't know that!


The 'not using HCL' bit is really the only positive I've found so far because everything else has been more difficult than just using TF directly. I think my goals were slightly off from the beginning, because this is really just replacing one CLI with another for me at this point.

What I want: Use Terraform programmatically, i.e. call "cdktf deploy" or similar FROM node or python and give users some scripts they can use where I can abstract away some of the difficulties of learning to use Terraform natively for simple use cases (i.e. deploy an S3-based frontend host). Ideally, I had intended to distribute some npm-installable packages which would run this stuff.


> What I want: Use Terraform programmatically, i.e. call "cdktf deploy" or similar FROM node or python and give users some scripts they can use where I can abstract away some of the difficulties of learning to use Terraform natively for simple use cases (i.e. deploy an S3-based frontend host). Ideally, I had intended to distribute some npm-installable packages which would run this stuff.

This is actually exactly the use case we’ve designed the Pulumi Automation API (https://www.pulumi.com/blog/automation-api/) to support.

Allowing modern IaC technology (like Pulumi or Terraform) to be easily embedded into custom software solutions, instead of just being something humans work with directly, is a huge potential enabler for the next wave of cloud infrastructure management tooling.


> What I want: Use Terraform programmatically, i.e. call "cdktf deploy" or similar FROM node or python and give users some scripts they can use where I can abstract away some of the difficulties of learning to use Terraform natively for simple use cases (i.e. deploy an S3-based frontend host).

Maybe not node/python, but I'm pretty sure you can use terraform as a package in go. If not, there is always the "make temp dir, write/download files necessary tf files, run terraform apply"


That's good to know at least; will give the go API a look. The latter option you're recommending is essentially what I went with (Node bin script that shells out to run cdktf commands).


There is little value in abstracting HCL al la troposphere. just write native


Thanks for the share, didn't heard about it before, I see the advantage in a use-case that we have :

We have are actually working on : "How to manage Terraform resources". We ended up having a conflict dev <-> ops where dev teams are messaging the terraform guys to create resources. For example, to create a database, it will take only 1 hour writing HCL, but 2 weeks of emails to align on the specs.

We are currently building something on top, to have resources that can be created in a self-service mode by the devs themselves. (Behind the scenes, it uses Terraform modules to generate resources that will comply with the company policies).

For the moment, it's a bunch of Jenkins pipelines. Having a CDK can actually help us a lot there. (Can plug it to a CMDB database, have a UI on top, etc)


Are all variables, conditionals, templating, and loops done in Python? Or is some of that still needed on the Terraform side?


Basically you create the desired state DAG in procedural code, rather than the TF DSL. Blithe diffing and applying are the same.


Can you inspect inputs from terraform resource attributes or data sources in the procedural evaluation?


> Can you inspect inputs from terraform resource attributes or data sources in the procedural evaluation?

No... the high level programming language really just serves as a bridge or translation layer to a Terraform compatible JSON file. Those sorts of evaluations don’t happen to the actually plan/apply. However, you may find it useful to make direct API calls to your cloud provider in cdktf stacks. For instance, I mostly use data lookups but if I want to perform string operations on that sort of data I would use boto3 instead.


I really wish CDK supported writing a Terraform plugin in any of those languages.


I really wish they would support CDK in Go!


Getting started with the AWS Cloud Development Kit and Go >https://aws.amazon.com/blogs/developer/getting-started-with-...


We were talking about Terraform CDK here, not AWS CDK. It is cool that AWS CDK supports Go though!


>"On a related note, CDK for Terraform allows DevOps practitioners to use a variety of programming languages instead of HCL."

Out of curiosity what is it that people generally don't like about HCL?


Bizarre syntax, crazy resource paths, whole directories just to define one function, etc etc etc.


Yet one more thing to learn. No types (hints in IDEs).


v2 has types, and my IDE ( VS Code ) does hints


The IntelliJ HCL plugin has full refactoring support, auto completion etc. It’s a shame it’s not open source, since there must be a Kotlin implementation of an HCL2 parser underpinning it.


Thanks for this, I was unaware!


Why would you use CDK with Terraform instead of CDK with CloudFormation? The latter seems like a more reasonable choice as it is native to AWS.


CloudFormation is really slow and lags behind in supporting all AWS features.

Terraform is super fast, has great docs and vast coverage for managing just about any type of cloud resource.


Why "practitioner"? it really does not feel like the correct term.



But that still does not align with the use of the word practitioner, it's usually a formal profession with licensing such as medicine or law.


That's just one way to use it. Practitioners practice something. For example you can find https://en.wikipedia.org/wiki/Martial_arts using "practitioner" for people training martial arts. There are practitioners of religion, techniques, etc.


Hijacking a bit, but does anyone have any good resources/guides around managing terraform state in larger organizations? Terraform enterprise seems to address this but I was wondering if there's workflows that allowed subsections of infrastructure (think teams or systems) and didn't rely on a re-evaluation of the entire organization's assets. So far the only approach I've seen is having protected high level (VPC, subnets etc) as a separate state and using terraform_remote_state to reference those.


There was a good talk at Hashiconf a few years back, can't find a link now.

I can't say that this is the "right" way to do it, but it scales OK for our org (100+ engineers, 200+ services per environment, many third party and in-house providers). A single team owns the codebase, though all backend engineers expected to write and maintain their own infra code. Some highlights:

  * Monorepo with ~ 700 terraform modules, > 1000 terraform workspaces
  * CI/CD tooling to work out the graph of workspaces that need to be executed for a given change using merkle tree/fingerprinting, and build dynamic plan/apply pipelines for a given PR
  * Strict requirements about master being up to date, and serialised merges managed by a bot, with continuous deployment (i.e apply) on merge
  * Templated code/PR generation for common tasks
  * Tooling for state moves, lock management etc
  * Daily full application of all workspaces with alerting on state drift (e.g externally updated resources)
As an org, we average about 20 infrastructure changes per day through this system.

A few tips:

  * Find the right level of abstraction for breaking larger workspaces down into smaller ones. This should be determined by things like rate of change, security requirements, team ownership, and in some cases whether you have a flaky provider that you want to isolate. Size of state should also be a consideration - if a workspace takes 2 minutes to plan, it's too big
  * If you start to use lots of remote states, wrap all remote states into a module with a sane interface to make consumption easier. You can also embed rules in this like "workspace X cannot consume from state Y" (e.g because of circular dependencies or security considerations)
  * Never embed a provider within a module (I think this is enforced in newer TF versions)
  * Terraform is a hammer that will can tackle most nails, but for several problems there are more appropriate tools


If you don't have a CI/CD system that already allows you to deploy 100 changes a day, you can't do large scale monorepos, or you'll get caught up in continuous integration hell.

In that case, the best choice is lots of remote states / data sources, independent modules in independent repos that reuse other modules, and strict adherence to internal conventions, including branching/naming/versioning standards running the gamut from your VCS, to the module, to the code, to the data structures, to the "terraformcontrol" repos, etc. Basically, standardize every single possible thing. If anyone ever needs something to work differently, update the standard before the individual module.

How and when to separate remote states is still a bit of black magic. In general, you can make a new state for each complete unit of deployment. Assuming a deployment has stages, you can separate terraform state into those different stages, so that you can step through them applying as you go and stopping if you detect a problem. The biggest mishap is when you're trying to apply 100 changes and your apply fails half way, and you have to stop the world to manually fix it, or revert, which may not even work. It's much easier to manage a change that affects a few resources than lots of them.


I really wish there were more in depth articles and tutorials around this topic as it was a pretty big pain point when I started out. It’s been a little over a year now since we started using terraform for our aws infrastructure and here is how we set it up:

- /modules holds a bunch of common modules. As an example, we have an aws_application module to setup application configuration in system manager, and an ecr repository for the docker image.

And then we our workspace folder (does not actually use terraform’s concept if workspaces). This goes down in specificity and uses terraform_remote_state for values from the previous workspaces. Our CI runs terraform automatically in this order:

- /workspaces/aws sets up our main aws account. No applications actually run here, it’s just iam setup and some random configuration values.

- /workspaces/production sets up a lot of the backend. We aren’t a big company so we don’t have to deal with cross region databases or redis clusters. This is also where we would call the aws_application module and setup a ecr repo.

- /workspaces/production-us-east-2 is where we setup the ecs cluster, task definitions, load balancers, and dns routes. We’re small, so we only have one region, but the idea is we could copy this folder to another region and horizontally scale super easy.

Then we have the same folders for our staging setup, only with some scale values and configuration tweaked.

Overall I’m pretty happy with this solution. It keeps any issues from spreading far due to the different state files. Though this can get pretty ugly if you need to use some region information in the environment workspace (production-us-east-2 values in production). I also can’t comment on how well it scales out past what we have done so far.


The way we do it is 1) each team owns one or more of AWS sub accounts (e.g a particular app or function will be in its own account) 2) An internal version of this is used to establish and enforce company-wide standards: https://github.com/cloud-custodian/cloud-custodian 3) A repository of terraform modules is shared amongst teams to standardize on how common AWS resources are used (e.g. enforce X, Y and Z for S3 buckets)

This way, the per account setup (represented as a repo) is relatively small, common patterns are standardized, and there is still room for experimentation.


No, that's pretty much it.

Some core components that most will include as remote state, and you can discover and use as remote state stuff from other teams, too.

Quite basic, but something tells me this could very well be intentional, so that the enterprise offering has a clearly increased value.


We decided to go for Terraform Cloud in our company, it works really great.

You can setup a workspace per subfolder, setup to automatic planning or not at each new commit, and put a team on it.

Main features I see : store the state and take care about the durability and the encryption + can create teams + cicd pipeline + manage variables.

There is a free tier until 5 active users also, good for experimenting it / showcase the advantages to stakeholders


We use Spacelift at our company, pretty new tool, but I prefer it to Terraform Enterprise and the migration was way easier.


When it was up to me, I managed things with small projects, rigorous standardization of naming and tagging, and a shared “metadata” module that could look up all the details you needed based on region, account (from the provider) and VPC name. Takes some discipline but makes for much more efficient Terraform, IMO.


I think in larger companies you separate infrastructure from teams. Teams are provided with tools or ways to set up new resources at a higher level.


That’s pretty much the only way I’ve found using the existing open tools.


Perhaps off-topic but how have people upgraded TF codebases to new versions? Just last year we had a big effort to upgrade a huge code-base from 0.11 to 0.12. I feel like it should be a lot smoother than a full-team full-sprint effort.


I'm one of the HashiCorp founders.

Terraform 0.11 to 0.12 is by far the most difficult of the versions to upgrade between. I am really sorry about that. The other upgrades should be relatively minor as long as you read and follow the upgrade guides and upgrade one minor version at a time (0.11 => .12 => .13 etc.). There are rough edges for very specific cases but most of our customers were able to upgrade from 0.12 to subsequent versions same day without issue.

Breaking changes and difficult upgrades is not something we want to do with Terraform (0.12 being a big exception as that was a very core "reset" so to speak). The reason there have been these changes in these recent releases is that we've been ensuring Terraform is in a place for 1.0 that we don't have to have difficult upgrades.

You can see this path happening in each release:

- Terraform 0.15: state file format stability

- Terraform 0.14: provider dependency lock file

- Terraform 0.13: required provider block (for more deterministic providers)

- Terraform 0.12: stable JSON formats for plans/config parsing/etc. (particularly useful for interop with things like Terraform Cloud and 3rd party tooling)

This was all done to lead up to a stable 1.0 release.

As noted in the blog post, 0.15 is effectively a 1.0 pre-release so if everything goes well, we've MADE IT. For 1.0, we'll be outlining a number of very detailed compatibility promises which will make upgrades much easier going forward. :)


Our teams have something like 100,000 LOC in Terraform 0.12, and it's not all in one big monorepo. At that scale there is no such thing as a relatively minor version upgrade.

We want to upgrade to get away from some persistent 0.12 bugs, but we literally don't have the time. We have to change all of the code, and then test every single project that uses that code in non-prod, and pray that the testing finds most of the problems that will appear in production. And it's all owned by different groups and used in different projects, so that makes things longer/more complex. We also have to deal with provider version changes, upgrading CI pipelines and environments to be able to switch between Terraform binaries, and conventions to switch between code branches.

I am already looking around for some way to remove Terraform from our org because it is slowly strangling our productivity. It's way too slow, there's too many footguns, it doesn't reliably predict changes, it breaks on apply like half of the time, and it's an arduous manual process to fix and clean up its broken state when it does eventually break. Not to mention just writing and testing the stuff takes forever, and the very obvious missing features like auto-generation and auto-import. I keep a channel just to rant about it. After Jenkins, Ansible and Puppet, it's one of those tools I dread but can't get away from.


You can use tfenv to upgrade individual workspaces one at a time. You don't need to do a big bang upgrade.

Note upgrading to 0.13 is quite easy and terraform actually has a subcommand that does most of the work you (usually no additional steps required).

> I am already looking around for some way to remove Terraform from our org because it is slowly strangling our productivity.

The only other alternatives you have are Pulumi. All other alternatives are in my opinion, way worse. You can use ansible, which I'd even worse because you have to manage ansible version upgrades and have no way of figuring out what changes will be made (yes, --diff is usually useless). You can manage manually, but good luck. Lastly your option is CFN (or Azure/GCP equivalent) but then you have no way of managing anything outside of the cloud environment.


There is no solution where 100k loc is not going to be challenging to keep over time.


While it's not possible to make an apple-to-apple comparison (Terraform-to-?), if we compare to something based on an imperative language, say Puppet or Chef, there is a huge difference.

In my opinion, Terraform's big issue is that it was born as a declarative tool for managing infrastructure. Large configurations (IMO) necessarily ossify, because you don't have an imperative language that makes small progressive changes toleratable - it's a giant interrelated lump.

What's worse, when it grows, one needs to split it in different configurations, and one loses referential safety (resources will needed to be linked dynamically).

A, say, Chef project of equivalent size, can be changed with more ease, even if it's in a way, even less safe, because you have the flexibilty of a programming language (of course, configuration management frameworks like that have of different set of problems).

I'm really puzzled by the design choice of a declarative language. Having experience with configuration management, it's obvious to me that a declarative language is insufficient and destined to implode (and make projects implode). Look at the iterative constructs, for example, or the fact that some entities like modules have taken a long time to be more first class citizens (for example, we're stuck with old-style modules that are hard to migrate).


> compare to something based on an imperative language, say Puppet or Chef

I'm puzzled by this comparison. I consider both of these to be primarily declarative languages. You declare the state you want puppet or chef to enforce, not how they get there.

E.G. https://puppet.com/blog/puppets-declarative-language-modelin...


I've indeed stretched the concept by equating Chef and Puppet (I guess the latter is closer to TF).

To be more accurate, I'd say that Chef has a declarative structure supported by the imperative constructs of the underlying language, and this is what makes for me a big difference.

Consider the for loop as example. By the time it was added (v0.12), there was a (200 pages) commercial book available. And there are people in this discussion stuck at v0.11.

The difference in the declarative vs. imperative nature, as I see it now that the for loop is implemented in TF, is that it's embedded inside resources, that is, it fits strictly a declarative approach, and has limits. In Chef, you can place a for loop wherever you prefer.

Object instances is also another significant difference. It took a while for TF to be able to move (if I remember correctly) module instances around (that is, to promote them to "more" first class citizens), which made a big difference. In an imperative language, accessing/moving instances around is a core part of the language. In Chef, pretty much everything is global - both in the good and in the bad. But certainly the good part is that refactoring is way more flexible.

I think TF has always been plagued by repetition; in my view, this is inherent in the "more" declarative approach (since they're trying to embed imperative constructs in the language).


I have really bad memories of the change between puppet 2 and 3 for example.


Same. I went through a puppet 2->3 migration and also through a terraform 0.11->0.12 update.

The puppet migration was definitely more painful, because of the entangled code.


It's not clear if it was entangled because it was written in the specific framework or because it was just badly written code. In the latter case, this hasn't really anything to do with the framework. Additionally: did you make the v0.12 a migration just work, or did you change the codebase to take advantage of the new features (and remove inherent duplication)?

There are inherent problems in the TF framework and the migraitons. 0.12 introduced for loops, and 0.13 added modules support to them. So a proper migration should convert deduplicate resources into lists of resources. This is painful for big models, since one needs to write scripts in order to convert resource associations in the statefile. And hope not to miss anything!

Due to the strictly declarative nature, it's also difficult to slowly move duplicated resources into lists, and handle both of them at the same time.

At this time, our time is stuck with a certain TF version, and can't move without spending considerable resources.


Yeah same boat. We ended up doing several complete rewrites and finally giving up. My main grievance is hcl, it’s so close but so far from an actual programming language that it drives me mad, even after a few kilolines of it in prod.. we ended up going with pulumi which so far has served us well


It seems that terraform CDK has been introduced to compete directly with pulumi.

I think both are a great idea as the DSL has given me so many headaches over the years.


Tangential, but curious how did you get to 100k lines of TF? I’d imagine most things within your company would follow very similar patterns and therefor be extracted into modules, and the per app/team code would be relatively small and focused on how to compose these modules together.


Modules are useful only up-to a point. Creating complex modules with a ton of moving parts makes it difficult to make changes, to upgrade etc. The best recipe that I’ve found is to use modules to enhance some core functional component and then compose these modules to build infrastructure, rather than defining your entire stack in a single module.


We also found tf 0.12 to be quite slow. But this was fixed in 0.13 and how it feels lightning fast compared to before.


> and it's not all in one big monorepo

There's your first problem.


Thanks, good to know that the upgrade to 12 is the biggest jump.


I had the same question or concern. I also realized too late that 0.12 is a bigger one than first thought. I was not severely impacted in the end, but boy it was a long time that I didn't experience such a tough upgrade. Happy to know that the hardest is behind and looking forward to try 0.15. Thanks


I upgraded from 11 to 12 like one year ago and from 12 to 13 some days ago (upgrade to 14 seems that will be straightforward) but in my case what I did:

  - Don't upgrade directly from 12 to 14, go to 13 first
  - If you have warnings after moving from 11, fix them first
  - Run the 0.13upgrade command in your code that will generate the required_providers
  - Run terraform-v13 init
  - Change to correct workspace if using some
  - Run terraform-v13 plan which will probably fail due to the new explicit required-providers rule, if that happens you need to modify the state with the correct providers https://www.terraform.io/upgrade-guides/0-13.html#why-do-i-see-provider-during-init- . In my case I have a lot of modules so I created an script that automated that process
  - Execute again terraform-v13 plan and verify that it will not make uncommon changes
  - Then run terraform-v13 apply.


> Don't upgrade directly from 12 to 14, go to 13 first

The line above, plus running apply in each version is key. I literally just did the update from 11->latest for 3 different repos a couple of weeks ago. And tbh, its was only the first update where I had to make any code changes. The rest mostly worked.


Someone not in our team upgraded version by mistake from 0.12 to 0.13 (he was contributing something small and used the latest), the CTO got involved and made us update everything and it was a big undertaking.

Personally I have a nix shell file pinned to the exact version of terraform (as in, commit hash on the nix-packages repo) we use in every repo and just switch to that shell before doing anything.


We had this issue as well where another team was on 0.12 and we were still on 11 and even running a plan I think could ruin the state.

We now have all of the systems tf version pinned:

``` terraform { required_version = "= 0.12.29" ... } ```

As other said tfenv let's you easily switch between versions but you don't get a warning if you're accidentally on 0.12 but the repo is currently using something else.



I'm a fan of tfenv for this; it's really easy to use and makes it trivial to pin each stack to an exact version of TF.


Between rbenv, tfenv, pyenv, sdkman and so on and so forth, maybe it's time for some sort of common OS-level env-management interface...?



I use asdf and it’s great. You keep your versions in .tool-versions abs when you switch for branches you automatically get the right versions of node, java, terraform etc on your path.


Using this. It's fantastic.


You can whip up something pretty quickly as a shell wrapper and add to it as you need. I just threw this together: https://gist.github.com/peterwwillis/755002d6d3849af5bbc6cb8...

  $ cliv
  
  Usage: /home/vagrant/bin/cliv [OPTS]
         /home/vagrant/bin/cliv [-i] VERSION [CMD [ARGS ..]]
  Opts:
          -l              List versions
          -h              This screen
          -n              Create a new /home/vagrant/.cliv/VERSION
          -i              Clear current environment
  
  $ cliv -n tf012
  $ cliv -n tf013
  $ cliv -l
  tf012
  tf013
  $ cp terraform-v12 ~/.cliv/tf012/bin/terraform
  $ cp terraform-v13 ~/.cliv/tf013/bin/terraform
  $ cliv tf012 terraform
  Terraform v0.12.29
  $ cliv tf013 terraform
  Terraform v0.13.3
  $ cliv -i tf012
  PATH=/home/vagrant/.cliv/tf012/bin:/home/vagrant/bin:/usr/local/sbin:/usr/local/bin:
  /usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
  PWD=/home/vagrant


I had to put a newline in your PATH because the unbroken string was borking the page layout. Sorry, it's our bug. Someday someone will show me how to fix it. Some people have come close.


That is a selling point of nix and docker, yes:) The catch is that doing it generically makes the whole thing more complicated (although I suspect nix and docker are more complex than is strictly required for that use case)


Our team just did a big effort to get from 11 to 12. After that the effort was quite minimal. Some little gotchas re: providers in 13 but we're just today finished the 14 upgrade and will probably let 15 marinate until we upgrade to that.


I think we're dealing with the providers now as we get warnings on 0.12.29 that they'll be deprecated in 13 so nice to know that we're over the biggest hill.


It becomes much easier between other versions. Though upgrading 0.12 to 0.13 I remember, i had to pull the state and change the provider field manually to avoid recreation of some resources.


> i had to pull the state and change the provider field manually to avoid recreation of some resources.

Terraform CLI introduced an upgrade command (can't remember what it's called) that automatically does this for you.


In 0.14 that is no longer available.


Terraform 0.14 is pretty much fully compatible with 0.13, so no such command is necessary. All you have to do is make sure your state is for version 0.13. Outside that it's a bunch of usability changes that do not affect your tf code.


On the bright side, now that they've committed to a stable state file format it should get better


With lots of swearing, at least in my experience when we did the same upgrade as you. I too am hoping this will be smoother in future and the changes stabilise. I guess if you're doing very simple stuff then it's not an issue but the larger the estate and 'custom' stuff that gets added the more difficult any upgrade becomes (tf or not)


From our experience. 0.11 -> 0.12: We're not attempting. We're in the process of changing out config management anyway and we're unhappy with a bunch of decisions in that terraform stack, so it's a good point to get rid of it.

0.12 -> 0.13: Well, we had to add what feels like a million required-provider-blocks and then some more. And sometimes it was tricky to pinpoint the module pulling in a default provider and crashing - `terraform providers` and `terraform graph` help there. The graph is easy to grep through to find the resources and modules pulling in wrong providers. And in the beginning the error message we got when we had to run 'terraform state upgrade-providers' was .. obscure. In newer versions, that message is much better.

0.13 -> 0.14 just happened and now the lock files are slowly piling in on demand.


As no one seems to have mentioned anything, version numbers below 1.0 are generally considered unstable in terms of API interface (like HCL in this case) so if you'd like to avoid similar things in the future (with either Terraform or other tools), you're best to wait until they at least release version 1.0.


I started using Terraform on our project in early 2019 which was version 0.11.13. The upgrade to 0.12.x seemed non-trivial so I put it off... now 2 years later and we're at 0.15.x.

Looks like I need to clear my schedule in a upcoming sprint to get this done so the pain doesn't get even worse :)


As everyone else said: 0.11 to 0.12 was painful. After that, you just need to deal with converting the state whenever a new version comes out, but that's basically automated so not a biggie.


terraform 0.1xupgrade went pretty smoothly most of the time, with some minor manual changes and fixes here and there. Most of them were easily done in batch over the repository using sed.

This approach worked for me in various setups from a small SaaS company to a major travel company as well as personal projects without issues.


After 0.12 it has been much easier.


My tiny brain still don't get why people like terraform. Do people need to look at both terraform docs and aws/azure/gcp docs when writing a .tf file? The fact that terraform saves/remembers the resource states is like a double-edged sword: we cannot manually fix some minor mistakes of ours when creating resources because that'll mess up terraform


Have you ever set something up manually via AWS console, then 6 months later totally forgot the steps you took and end up wasting a lot of time reverse engineering what you did in order to make a comparatively small change?

After that perhaps you vowed to take better notes, so the next time you do it that way, but then 6 months later you find that you missed some detail, or there was some changes in between that were not recorded.

So then you decide you're going to do everything by API and save the commands, so you write a bunch of bash scripts that execute against the AWS CLI. Over time you add several more for different operations (eg. adding an instance, configuring new ssl cert, etc), but the list of scripts grows very long, and you find that each one makes a lot of implicit assumptions about the state of the infra when its run, so you end up with varying degrees of confidence in the scripts depending on how often you run them.

Now you are primed for Terraform. At this point you realize the hard part about cloud configuration is state management. Furthermore, you realize there are some common patterns of how different components and APIs interact with regard to serial dependencies and operation idempotency, however the specifics vary by service and by use case. Terraform gives you a standard substrate for state management, and a framework for developing service "providers" that know how to interact with APIs and map them to state. All of this happens in code which is declarative, can be code reviewed, and state which can be centrally tracked and shared among a large team for a clear audit trail.

There's definitely a learning curve, but once you learn it the overhead is pretty small compared to the benefits, even for small teams IMHO.


I'm not parent poster, but I believe (when he mentioned looking at TF and AWS documentation) he most likely meant to use TF instead of CloudFormation. And frankly I think CF is far more robust.


The jump from what you mentioned in paragraph 2 to 3 is not necessarily Terraform. You can use other tools, like Ansible, that IMO have a much better framework than Terraform. I use Terraform for extremely simple stuff that is easy to destroy/recreate. Projects of bigger scale IMO are better served with Ansible and friends.


Ansible looks fine when you start, but gets painful.

You need to write idempotent ansible from the start. No excuses.

Then you would implement a retry logic in ansible since your cloud APIs will fail for weird reasons or you really must wait that a resource in AWS is truly ready before you run another task. ( for example vpn gateway in AWS must be ready and then you can attach a vpn connection, or a route53 zone record has to propagate before you do something with it. But in the meantime your other tasks can continue in parallel. Only the vpn tasks needs to wait)

Then after a while with all your resources created with your idempotency implemented in ansible, your code will roughly first check ( with timeouts, retries) if you actually can skip tasks since there is nothing to do. The checks increase, your playbook takes longer and longer to execute. Maybe you try to parallelize it in some way. In the meantime a colleague of yours executes the same playbook in their machine or pipelines, and now you see weird side effects. Maybe you start implementing some locking behavior.

At this point we did not do any code reviews yet and we did not answer the question: “how should the infrastructure look like?”

This especially means: how do you actually delete resources with Ansible? So you Start to introduce a kind of “state” for your resource.

So, if you squint hard enough, idempotent, resilient, parallelizable, thread-safe, fast ansible with a state is what terraform solves. If you look at tools like terratest you even get to do unit/integration testing for your infrastructure code.

So as soon as you are more than 1 person handling infrastructure following best practices like code review and testing, without getting insane, use terraform.

If you are alone and don’t want to follow the above best practices, you still have to be very good in writing idempotent, resilient ansible code.

My hypothesis is that the intersection of people who write high quality (idempotent, resilient) ansible code but people who do not care about code review/testing, is quite small.

Once you do not work in isolation any more, either in a team or as a consultant who needs to hand over their work (and Teach people how idempotent ansible works), I would favor terraform any time.

You could even call ansible from terraform if you need some ansible integration, where ansible has a nicer api for you. But let terraform handle the retries, state management and ansible could just be the executing part. So ansible is more like a fancy bash script/function which you call in an orderly manner.


> You need to write idempotent ansible from the start. No excuses.

Absolutely. And code reviews, etc. just like code.

> Then you would implement a retry logic in ansible since your cloud APIs will fail for weird reasons or you really must wait that a resource in AWS is truly ready before you run another task. ( for example vpn gateway in AWS must be ready and then you can attach a vpn connection, or a route53 zone record has to propagate before you do something with it. But in the meantime your other tasks can continue in parallel. Only the vpn tasks needs to wait)

Yes, Ansible modules have a retry built in or a wait_for statement. You can also control the serialization serialize: (or was it parallel), or run_once or when. You have a couple of tools at your disposal.

> At this point we did not do any code reviews yet and we did not answer the question: “how should the infrastructure look like?”

Why not? We usually start with some diagram in README.md or Confluence or a Design Proposal before the code. Code never gets released without a merge request.

> This especially means: how do you actually delete resources with Ansible? So you Start to introduce a kind of “state” for your resource.

state: absent, you can use in about every module.

> So, if you squint hard enough, idempotent, resilient, parallelizable, thread-safe, fast ansible with a state is what terraform solves. If you look at tools like terratest you even get to do unit/integration testing for your infrastructure code.

I believe terraform hides way too much and is too magic. Troubleshooting when terraform goes wrong is really really hard. You have two states to look on -- the cloud state and the TF state.

And now you need to know two tools too, since you will likely need Ansible anyways to configure and operator the hosts later.

> So as soon as you are more than 1 person handling infrastructure following best practices like code review and testing, without getting insane, use terraform.

Beg to differ, 100s of engineers sharing ansible galaxy roles with semantic versioning and whatnot in a very good way at my org.

> Once you do not work in isolation any more, either in a team or as a consultant who needs to hand over their work (and Teach people how idempotent ansible works), I would favor terraform any time.

Maybe you're right. If I'm doing freelance I will likely use Terraform to interact with Cloud. IMO Ansible, just like coding, requires a bit of a community around it with best practices, reviews and etc.


> Beg to differ, 100s of engineers sharing ansible galaxy roles with semantic versioning and whatnot in a very good way at my org

I understand we both have our experiences when using ansible compared to terraform. Good to know that this scales

Regarding your point:

> state: absent, you can use in about every module.

What I meant is that you probably would create 2 pull request. The first one to actually delete things for your infrastructure and then a second pull request to delete your ansible code. I am still not sure how changing things work in this manner, where you destroy and recreate things in a single logical Pull request/commit for example. I guess tracking multiple pull requests in a single ticket might work.

> And now you need to know two tools too, since you will likely need Ansible anyways to configure and operator the hosts later.

I agree. Well, I would use ansible beforehand with Packer to build immutable images/AMIs which are launched by an auto scaling group and a final configuration via cloudinit. Therefore I would not have a chance to apply ansible during autoscaling.

Thanks for your input and valuable discussion. Stay safe


Thank you for this. I've been hemming and hawing over straight Ansible or Ansible + Terraform and you made it clear why the later makes sense.


I'm no tooling expert but that seems like the opposite of what you should use Terraform for. Imperative frameworks like Ansible are basically fancy ways to organize and execute custom scripts, you still have to implement your own custom state management logic in Ansible commands/scripts. With larger & more complex infrastructure, you want to do this as little as possible because it's hard to consistently get it right, so Terraform basically steps in and implements state management for you. Even if you end up needing to write your own TF drivers, regular devs don't have to deal with it to make infra changes. Declarative templates are much simpler and easier to think about, all the complex state management stuff is hidden away.


Ansible has a bunch of declarative modules. Writing imperative Ansible should basically never happen. My team maintains 50-80 galaxy_roles and it's super rare that we have to build any shared galaxy role on top of imperative logic.


True, and interesting to see that it has worked out so well for your team. I suspect the difference I see in practice is that Ansible playbooks make it much easier to hack in imperative shell scripts in an Ansible task, so it's more likely to happen. I agree that a disciplined & experienced team can do the right thing with either tool and perhaps Ansible's API is nicer if you use it right. This makes me see Ansible in a slightly different light, perhaps I've been unfairly influenced by some Ansible playbooks I've seen in real life.


No, Ansible is not good for this. It's good for maintaining clusters of hosts, but its stateless nature makes it close to worthless for infrastructure.


Well I guess I beg to differ. Been creating and managing 10ks of hosts across multiple on prems datacenters and clouds for some time now.


How do you deal with this with Ansible instead of Terraform?

I'd love to have a single tool, but Ansible's style seems like a poor fit especially compared to Terraform.


Perhaps we have a different view of what infrastructure means? Managing hosts isn't it by my view.


Possibly! Care to share your view?


Great explanation!


> we cannot manually fix some minor mistakes of ours when creating resources because that'll mess up terraform

This is a feature, not a bug. Terraform is a tool to (reproducibly) enshrine your infrastructure in code. 'Minor manual fixes' are often left undocumented and known only to 1-2 people, and suddenly become major problems when your infrastructure comes crashing down and nobody knows why it used to work.


Reproducible infra, gitops, automation and much more.

For me, the biggest thing is, when I go into AWS I struggle to find everything that is intrinsically linked to another resource. Say you have a lambda, to find which iam is linked to it, and what permissions it has is 2 separate tabs, then another for e.g. security groups, probably more tabs for other things. While using aws-cli makes it slightly easier, it's still a lot of effort to do this effectively.

With terraform I can look in one repo that has all the above, often in the same file too. Finding out what your infra looks like is a lot easier.

Regarding the state, you should not be touching your infra outside your code, if you do (e.g. while you're testing in dev), you should make the same changes in tf once you've confirmed it's what you want, and otherwise you undo those changes.

With further automation (e.g. tfcloud) you can even enforce these things by auto applying workspaces which ensures manual changes are always undone.


> we cannot manually fix some minor mistakes of ours when creating resources because that'll mess up terraform

In my opinion if you're doing manual fixes you're doing it wrong. Let's say you do your manual fix in your Dev environment. Do you remember to do it in Prod/whatever other environments you have? Are you sure you did the EXACT same thing? Did you change 5 other things trying to fix it first?

You end up with so many different deployment environments that are unique 'snowflakes', and when something breaks in one it might not affect the others cause they're in totally different states.

It's a nightmare.

In my opinion, infrastructure as code is the only way to do it in a serious environment.


I think there's a middle ground if you're not sure how to fix a mistake in Terraform but you know how to do it in the console:

* Make your changes by hand

* Right afterward, run "terraform plan" to see how Terraform would undo your changes

* Edit your Terraform config to reflect those changes, and run "terraform plan" again to make sure you caught everything. Repeat until it's a no-op.

Now you've got a log of what you've done in a Git-ready format, and you can repeat it elsewhere, and you've learned how to make that console change in code.


This. I’m surprised at how many folks don’t realize you can do this and capture your changes in terraform by looking at the plan and making tf code changes until the plan doesn’t show a diff


You can also fetch cloud resource state with terraform, without running plan - I can’t remember the exact command. You can use this to import new resources into tf


You can do it that way, but I find the tf docs easier and more concise to use than clicking around the AWS UI.


I do too, but that was an invaluable tool when we were first switching over and learning the ropes.


Terraform allows us to implement development practices into our sysadmin lives. Such as code reviews, etc.

For example, at my work this is what i do to apply changes to our AWS setup:

1. Fetch the latest version of our git repo.

2. Create a new git branch named after the Jira ticket im working on.

3. Solve the jira ticket by modifying the terraform code accordingly.

4. Submit a pull request and assign one of my colleagues as reviewer.

5. They review my solution, tell me to correct some issues that there might be, or straight up approves my solution.

6. My PR is merged into master.

7. I download the latest master version and apply the codebase.

This way we always have at least two people verify any changes to our infrastructure, minimizing the risk of fuckups and ensures solutions are as good as possible.


Reading the sister comments, I kind of understand the appeal of terraform for huge/multi cloud infra systems.

Now, managing changes in code doesn’t look too far from dealing with kubernetes’ json/yml config and applying them to the current cluster, provided it would be trickier when expanding to multiple cluster or doing complex orchestrating.

I guess TF makes a lot more sense for on-premise, bare metal VMs ?


Also cloud hosted services. E.g., terraform knows how to manage hundreds of AWS resource types (IAM roles, DynamoDB tables, S3 buckets), sometimes even before CloudFormation does.

There seems to be some kind of https://github.com/aws-controllers-k8s/community in early beta, but I don’t know how much it supports so far or whether it can diff a resource and modify it in-place.


I think the right way is to use TF to provision the kubernetes cluster and underlying resources such as storage connections, dns, etc, and then use kubernetes to deploy the app.


Use Terraform for bringing up the Kubernetes cluster and other related cloud resources (storage, load balancers, DNS, etc).

Once you have a cluster up and running you can just use kubernetes yml to manage the cluster.

It is possible to also use Terraform to manipulate kubernetes resources (or AWS ECS or another container platform) but I personally like a clear separation between infrastructure (bringing up the environment using Terraform) and scheduling work on a cluster (using kubectl)


So TF would come where you would otherwise use gcloud commands, if I get it.

One part of the cluster building that caught me off-guard with GCP is that cluster options effectively change over time, as the k8s versions march forward.

For instance some feature go out of beta and become the default, and trying to rebuild a cluster exactly similar to one built year ago would require specific “don’t use that new feature” flags added to it. Or using these new features require adapting the other resources to the new configurations.

I guess TF is still useful in that it helps audit and reproduce the same infra inbetween changes, and will be nicer as an interface than bash scripts, even if it won’t save from dealing with nitty gritty of the platform.


Indeed, I see TF mostly as an automation/idempotency tool for setting up and maintaining infrastructure components like Kubernetes clusters, ECS clusters, load balancers and all that. Something you would normally do with gcloud or aws-cli or by clicking around in GCP/AWS web console.

Yes, the "provider churn" is real: defaults on cloud platforms can change and platform-specific TF providers can change as well. The way I usually deal with that is to make sure my TF repo's stays in shape by applying them regularly. If you haven't applied a TF repo in over a year chances are real that your repo has rotted. In a similar way to an old Ansible playbook that has rotted because some apt packages have new dependencies. Keep applying them regularly and manage changes in small chunks instead of once a year.

Auditing is really useful. I would recommend preventing developers from applying TF to production directly, and have it all managed by a CI pipeline.

Also with some careful planning and structuring of your TF repository. It's pretty straightforward to duplicate environments: for example: spin up an extra development environment to experiment with a newer K8s version, or to validate some infrastructure changes.


> I guess TF makes a lot more sense for on-premise, bare metal VMs ?

No, not necessarily. If you have an API for orchestrating your on-premise infrastructure (openstack etc), then sure.

Where terraform shines is managing resources in your cloud provider.

I.E. creating the GCP project and GKE cluster that you need before you can apply your k8s YAML. Or creating the cloudsql databases, GCS buckets, etc that the apps running in your k8s cluster need.


> I guess TF makes a lot more sense for on-premise, bare metal VMs ?

Sure. And also for Kubernetes json/yml. You don't just create infrastructure but it will also change -> delete/create.

And establishing a process how to delete your Kubernetes json/yml files in some form of codereview or in some way so you know what it actually deployed, you end up with a tool like terraform.


"I just spent a couple of hours standing up a specific resource, which was kind of a pain in the neck. Oh, now I need 23 more of them! That'll take another 30 seconds to generate the configuration and create the new resources."

That's why I like Terraform.

Edit: "Also, I created a thing 2 years ago and now I need another one. Oh, here it is in Terraform." It's incredibly nice not to have to re-learn how to make that thing, and remember what its quirks and dependencies are.

In short, I consider Terraform exactly like I consider shell scripts: it's not perfect, not by a long shot, but once you have it working it tends to stay working and you don't have to invent it again the next time you need to do something.


> "Also, I created a thing 2 years ago and now I need another one."

Let me add, this thing was changed 2 years ago to add an option...why? Git blame to the rescue...

I probably use git history more in my tf code than in my regular programming.


Absolutely invaluable to have this, especially for infrastructure changes which often involve ugly hacks to get around limitations but which may just not be clear at first.


Because the state exists regardless. When you write code that interfaces with AWS, there's always something that exists on the other side of the AWS API. The question you have is, how do you programmatically keep a copy of that state on your side of the API?

The naive approach that tries to do this without state in code goes something like:

a) invoke the remote API to look for something that should exist b) if it doesn't exist, invoke the remote API to create it c) use the property of the thing to do something else

What Terraform does is cleanly separate your state (don't look stuff up you already know exists) from the API (which is handled by an open-source library, aka the provider) from your configuration (which can now be declarative). Because the state and API calls are abstracted away, the configuration itself is much more clean, easy to reason about, and easier to maintain.


Maybe an unpopular opinion but what you just described as "naive" is arguably a better solution than Terraform's overengineering.

I use Ansible to manage multiple clouds (Openstack, AWS...etc) using a mix of custom modules and public collections. I don't need a "state", I couldn't care less if resources exist or not, upgrades between versions are smooth, module/collection upgrades doesn't interfere with all the existing resources we already have... every time I run a playbook I know that everything will end up just like I want it to be. Not bad for a naive approach I guess.


If you have an ansible playbook that creates a certain resource, and you delete that code. Next time you run it, it won’t delete the resource because there is no state management.

You have to add code to as only to be sure to remove the non-longer needed resource. But how long does that code need to stay there.

Ansible is supposed to engender a decorative approach, but it’s very easy to slip into procedural code. Whereas terraform is much more declarative.


Who cares if there is a dangling dns records somewhere or an extra allocated floating ip? In practice you could just set state:absent to whatever you are trying to remove or just remove it manually, the latter is most of the time faster than dealing with state management once you have a behemoth in prod that no one wants to break.


> Who cares if there is a dangling dns records somewhere or an extra allocated floating ip

What if instead of a dangling dns record, its 15 large EC2 instances?

Yes, you can come up with examples of trivial dangling resources, but it's just as easy for me to come up with non-trivial examples of dangling resources.


I came up with trivial examples because no one forgets about non-trivial resources. In my opinion, if you decrease some instance count from 18 to 3, you'd rather waste 1 minute deleting 15 instances than dealing with all the problems a state management brings to the table.


> no one forgets about non-trivial resources

The number of articles I've read about someone who left a non-trivial number of resources running unused in AWS and were later surprised by a large bill would seem to be a counterexample to that point.


> Who cares if there is a dangling dns records somewhere

You should care. This opens you up to subdomain takeovers, which have real security implications.

https://developer.mozilla.org/en-US/docs/Web/Security/Subdom...


Some cloud resources will cost money every month, forever. (I think this is an unappreciated side of the AWS business model; it’s not cost-effective to have a dev confirm that each resource can be safely decomm’d.)

There’s also a risk that your legacy environments only work because some dangling resource wasn’t cleaned up, and a new clone of the environment will fail.


> In practice you could just set state:absent to whatever you are trying to remove

If you do this, or in fact anything with Ansible, be REAL careful about double-checking what your tags actually match before committing. Since it doesn't track state, anything in your cloud environment is fair game.

I was not careful once, and that was a bad week for me.


Part of the benefit of Terraform is the ability to set up ephemeral resources and tear everything down afterwards with "terraform destroy", which is useful for setting up one-off experiments and tests. That kind of cleanup is completely impossible with Ansible.


Not true, just set state to absent and run your play again.


You pay for some of these (like dangling IP addresses not in use) and some others have a max quota (like security groups).


Any dangling resources may cost money and/or open up security concerns.


I haven't used Ansible so maybe I'm incorrect here but aren't tf and Ansible solving slightly different problems?

Terraform feels like infrastructure management to me. We use it to provision underlying resources: Networking, Clusters, Nodes, Alerts, etc. All of the actual code deployments are entirely separate.

Ansible is more of a configuration management right?


Right. And you can use something like terraform-inventory[0] as a dynamic inventory source in Ansible. So TF manages all the bits and bobs floating around in AWS, and then ansible takes over to manage configuration on whatever EC2 instances are involved.

[0]: https://github.com/adammck/terraform-inventory


Not really. In terms of functionalities, I consider ansible as a superset of terraform, without all the state management stuffs. Distributed systems are hard, and I will just let my cloud providers to be the single source of truth of all the states.

Here's a comment from the author of ansible when terraform was first released: https://news.ycombinator.com/item?id=8100036

> One of the things shown in the Ansible examples are how to do a cloud deploy in one hop, i.e. request resources and also configure the stack all the way to the end, from one button press, and can also be used to orchestrate the rolling updates of those machines, working with the cloud load balancers and so on, throughout their entire life cycle -- all using just the one tool.


I've never used them but there are modules to provision stuff on AWS: https://docs.ansible.com/ansible/latest/scenario_guides/guid...

Edit: seems limited to ec2 and s3: https://docs.ansible.com/ansible/latest/collections/amazon/a...

Although you could provision other services with some custom modules using the aws-cli.


Yeah. I use packer and ansible to build/configure AMIs. Terraform manages the configuration that launches said AMIs (through things like autoscaling, etc...).


> upgrades between versions are smooth

I use tf and ansible regularly. I wouldn't call ansible upgrades exactly smooth, they deprecate features just like anyone else.


How do you ensure you have say 2 web servers created and connected to a load balancer? Is that part of your custom module?


Using ec2_instance_info to check if the instance exists filtering by name (eg: selectattr('tags.Name', 'defined') | selectattr('tags.Name', 'equalto', server_name) ) and then standard ec2_instance module, same for lb.


To answer your questions: you generally look in the Terraform docs which are well written and always updated, because autogenerated. The state forces you and especially the team to almost never touch things directly. And if you do, you feel nasty for it.


What else to use? All other tools operate at the same level as terraform, be it cloudformation or anything else. It's just drivers for the cloud API in question, each with their own drawbacks, idiocracies, limitations and workarounds. In a sense, these are all equal effort for the user.


Ansible? Compared to terraform, it should require less effort from the user, without having to worry about states. Compared to aws cloudformation or gcp cloud deployment manager, it should require less effort as well, without having to learn the different idiosyncrasies of these proprietary tools.

> It's just drivers for the cloud API in question

Exactly, and if you are on gcp, both the ansible modules and the terraform modules are even generated from one code base: https://github.com/GoogleCloudPlatform/magic-modules


I used ansible for AWS a few years ago - it's terrible. Not by design, just that most of the modules are buggy and incomplete.


A manual fix requires a git commit, as most use Terraform with GitOps and IaC.

The point of using TF for many is reproducible infra and change approval workflows. So avoiding manual changes via the web consoles is what people are striving for.


I spent a lot of time in both sets of docs for sure (along with perusing the console and creating test resources manually to see what options and switches I might be overlooking).

I think tf really shines when you start using multiple providers to manage things outside of the cloud though. In my last gig I was for example using the workspaces feature and auth0 provider to have separate auth stacks for our different envs, being able to use values created by one providers resources in another's was nifty


There are docs for terraform which are fairly general and then provider specific docs as well (aws/gcp/azure/etc). We like it because it is easy to recreate environments and entire infrastructures. It also makes it much easier to review infrastructure changes.

>manually fix

You can if you need to. TF can ignore changes to certain aspects of resources like desired count of ECS tasks which might autoscale up and down and you can always make the change manually and then update the code to reflect the change. That's a bit of a no-no though. Like pushing code without a review.


Well... I'm puzzled about it on AWS. I don't know about Azure, but on GCP I tried their equivalent of CloudFormation and it was absolutely POS. Reported a bug in their repo, and got response acknowledging the bug, but it's in specific component's API that is owned by another team (within Google of course) and there's not much they can do about it.

Like WTF? Do they think I have better chance reaching them than them? Tried then TF and it was much simpler and everything worked as expected.

On AWS though, seriously, TF is inferior.


> we cannot manually fix some minor mistakes of ours when creating resources because that'll mess up terraform

I find this to be the best advantage to Terraform. These small changes make your infrastructure harder to re-produce in the future; and unless you have strong documentation rules, you lost what is there in your infrastructure.


> we cannot manually fix some minor mistakes of ours when creating resources because that'll mess up terraform

I had an old boss who fixed minor mistakes in stored procedures on the live server and then wondered why they broke again when builds went out...


You can definitely manually fix something and then just have to update the terraform code and it should be fine. But it depends on what kind of manual fix you are doing as some changes are inplace while some require recreation.


You are not supposed to do anything manually ever in a cloud environment managed by a declarative tool. That's not just for terraform but for all of them.


You absolutely can manually fix and sync state:

`terraform refresh` does this.

(Disclaimer: I am an ex-core-maintainer of Terraform)


Awesome stuff. Congratulations on the release. Time to update the Ansible provisioner (https://github.com/radekg/terraform-provisioner-ansible) this coming weekend!


Looks like Terraform is really approaching maturity and the 1.0 release already on the horizon is a deserved milestone. I really appreciated the lockfile for modules/providers added in 0.14 and 3rd party providers in the registry being promoted to 1St class citizens. It enables saner and more modular architectures in an easy enough way.


One of my biggest annoyances with Terraform is how you cannot start out by creating your state remotely - you have to run it first with local state and then move it to a remote backend on a subsequent run.


Completely agree! My company has pretty strict rules for creating infrastructure in our upper environments. Devs are not able to do it and we want everything to run through our CI/CD pipeline. We either have to ask our cloud team to create the bucket or hack up our build file to add it.

It would be nice, on the first run of terraform init, if it would check for the remote state bucket and then ask if you would like to create it if it does not exist.


I have been using terraform professionally at my current workplace for a while now and I would not recommend it to anyone when building and maintaining non-trivial infrastructure.

- state management is a sad joke. Which would be fine if it wouldn't end up being an expensive joke sometimes. And by sometimes I mean right when you need to change production. Try destroying your stack with an AWS KMS key. Then create it again. Try killing apply and then try re-applying it. I have witnessed hours of developer time going down the drain to figure out what terraform was able to do, import resources, convince terraform that resources are added, write scripts to delete stuff.. A complete nightmare. Oh yeah, but just run apply twice, that usually fixes it. There is no such thing as a rollback in the world of terraform. We only roll forward!

- bUt ItS cLouD AgNoSTic. No, it's not. You're writing separate stacks with their separate own custom resources and their separate requirements and separate policies. Just use cloudformation, gdm or whatever your cloud uses and automate these using a script if you're doing crosscloud crap.

- but I want to avoid vendor lock-in. See above, odds are that you're already locked-in but still in denial.

- HCL. I get it. Declarative, nice and clean. HAHAHA, who am I kidding. It sucks, it's stupid and pointless. Awful to write, awful to read, inconsistent, no proper tooling or editor support. Good luck maintaining dependencies or comprehending a complex stack. Why didn't they just build a DSL over a language people already know? Good luck 'refactoring' your resource definitions. Oh, and you also want to test it somehow without affecting a remote environment? Again, good luck, sucker. Because 'the plan' is a lie. And error messages seem to be written by someone waging war against their customers.

In conclusion, I do not recommend terraform. In my opinion it is an overhyped piece of technology on which I would not bet my company/money on.


For various reasons, I cannot use terraform stock. I have to use terragrunt, which wraps around terraform and provides a lot of functionality that terraform should have had.


All this energy and griping by people. Don’t get it.

I’m excited for actual obfuscation. I’ve always thought tf cloud is probably the juiciest hacking target of all. All those keys.


I think TF is used a lot in production for a pre-1.0 "we can break what ever we want". But then I also use it, happily :)


Terraform (core) doesn’t follow semantic versioning, so pre-1.0 does not mean that. Ever since 0.4, the number of production users made working on it rather like changing the wheels on a bus while driving down the highway full of passengers!


That's what I mean: it is defacto 1.0 already, just no communicated.

And semver or not: pre-1.0 is pre-1.0.

I was glad to read this announcement that they are making strides to get to 1.0.


I do not like terraform and I believe it's a terrible tool.

Here is the error you get when it fails:

Terraform does not automatically rollback in the face of errors. Instead, your Terraform state file has been partially updated with any resources that successfully completed. Please address the error above and apply again to incrementally change your infrastructure.

If that error does not give you pause, I don't know what will. It's basically YOLO-ing it and in case it does not work you're on your own. It's also leaving you in whatever state and now you need to figure out what it managed to do, what didn't work and figure out how to 1) make it work 2) make terraform understand that it worked. Usually this happens at the worst time when you are deploying to production.

Do we want to talk about when plan "works" but it craps out on "apply"? Or about when it just loses tracks of resources.

HCL? A disaster. Why would you need to invent your own language? Why oh why? Hashicorp also knows better. Remember Vagrant? Vagrant got it right with the DSL + you could always drop down to Ruby if you did something not covered by the DSL.

My advice to you: use whatever the cloud you are using has built it (cloudformation, deployment manager, etc).


> use whatever the cloud you are using has built it (cloudformation, deployment manager, etc).

Not applicable if you're deploying infra across cloud (we do that across a dozen cloud providers!). Without terraform, it'll bring us tears.

> Or about when it just loses tracks of resources.

Never had this in years of using terraform. These should be qualified with how often they occur.

> Terraform does not automatically rollback in the face of errors. Instead, your Terraform state file has been partially updated with any resources that successfully completed. Please address the error above and apply again to incrementally change your infrastructure.

Checkout the code from before the change and apply again. I've done this numerous times and it's been as good as a natively supported rollback.

> HCL? A disaster.

Why is it a disaster? I find it predictable, ergonomic and easy to write. I also really like the JSON compatibility. But this one is I guess mostly taste.

> it's a terrible tool

I've found Cloudformation to be an abomination as far as user experience goes! Every time I used it, I lost a bit of my soul.


> Checkout the code from before the change and apply again. I've done this numerous times and it's been as good as a natively supported rollback.

Almost as good as native but not quite. Often times applys fail because you've introduced new providers with new resources. If you checkout the previous code and apply again then TF will explode because there's resources in the statefile that are now missing providers. So you have to manually patch in the provider from the new version on top of the old version just to remove the resources.


nah. i’ve heard the cross-cloud argument multiple times and it does not hold any water. i wish we lives in a world where we could do cloud agnostic stuff but the reality is that you have your aws terraform files, your gcp terraform files and so on.

Cloudformation? Like the tool that bring you from state A to state B or if it cannot rolls back to state A? Yeah. It’s an abomination for a tool to leave infra in a consistent state. Who would want that?

Re: checkout the code and apply again. What if I told you that terraform can mess up so hard that it won’t work? What if I told you that sometimes you will not be able to rollback OR even destroy your infra?

Here is an exercise for you. Create some infra with terraform and do a destroy and tell me if it managed to cleanup everything? I’m gonna bet you that for anything less than trivial you are gonna have a bad time. The solution? Tag shit and use a python script to cleanup. I’m not even kiddding.


We use it all the time. Not to deploy to multiple "cloud providers", but to update dnsimple domains with the cloudfront endpoint, set hirefire rules for heroku dynos, and create kapacitor alerts. It's great that they can all reference vales from other services, and it's broken up by coherent services, rather than where it's being hosted.


> nah. i’ve heard the cross-cloud argument multiple times and it does not hold any water.

You misunderstood. I'm talking a single terraform project provisioning resources across different providers.

We have AWS resources referencing stuff in Cloudflare and the other way, and Cloudflare referencing stuff in a k8s cluster running on-site. Add some some GCP LBs referencing a whole bunch of endpoints across 6 providers. They're all under a single TF project and a single tf apply makes changes across all of them.


>nah. i’ve heard the cross-cloud argument multiple times and it does not hold any water.

This, I use terraform every day at my work and I really like it compared to cloudformation but I never understood the cross-cloud advertisement. If I deploy, say, RDS cluster and Kinesis streams with terraform, how on earth is it cross platform?


See my other comment. It's not about a single resource working across providers, but the ability to reference resources across providers and linking them together. Think Cloudflare pointing to AWS endpoints.


I love your comment,"YOLO-ing" gave me a laugh, and a laugh at myself in some previously stressful/complex deployment periods. Terraform was not a friend there.

That said, I've worked on/with a team enabling some fairly advanced & streamlined capabilities across all three clouds because of Terraform. The abstraction matters, and ability to model scaled platform components and architectures as singular units is important. The open source community is great.


is the open source community great? Look at the open bugs. Some of them are old enough to go to school.

For extra lolz try killing the terraform process of pulling the network cable. Hilarity ensures and you’re gonna have a bad time recovering what it was doing.

I would not recommend Terraform period.


There was a bug fixed few months ago. Until recently when you tried to destroy the infrastructure it started to undo what was actually deployed. Previously it was using the current hcl file.


So purposely breaking state means it breaks, gotcha.


Have you never had a failure with CloudFormation? Because when that happens you can’t see what’s going on and you often can’t get out of the situation without engaging support.

More often than not with Terraform, another apply will fix the problems you are complaining about. But if it doesn’t at least you have the power to fix it unlike with CloudFormation.

But ultimately whatever tools you use you will have trouble if you are unwilling to engage with them and learn how they work.


I gonna share something shocking here: but I have never seen Cloudformation crap out in a manner that was not consistent and clear. Usually it will either try until it succeeds or fail in a clear and predictable manner.

(I have been using AWS since forever and both Cloudformation and Terraform(sad panda) for years. I will preach the shortcomings of terraform to everyone that listens. I have rarely seen developers using it for something non-trivial and being happy about it)


Roll back is one reason I choose AWS Cloudformation over Terraform. And also I do not see the reason for Terraform in my use case as we run only on AWS. No point in not using AWS native tools.


When using Terraform you should think of your IaC as the plans to build or rebuild your infrastructure, not an extension. I built mine in a way where I can destroy all the infrastructure in a plan and then reapply. You don't have to worry about rollbacks when you can just rebuild everything without affecting your end service.


welcome to the real world where you sometimes cannot do what you are saying :(


What parent meant is that in CF if it encounters an error, by default will roll back to state before the update. TF will just leave it broken.


It isn't left broken per say, in fact you should be able to version the state and rollback to the previous state and have it remove whatever new resources were created when you re-apply.

I just take another approach which is if something doesn't work, then I'm able to quickly blow it away and re-deploy without affecting the end service by having the apply only target specific environments.


If it is half deployed it is broken, maybe in some cases service continues running, but in other will be broken.

The solution you mentioned relies on other tools and some manual work you need to do. Also with how TF works by default you might never be able to restore to exact same state as it was before.

CF when needs to replace a resource it first creates a new copy, and applies all changes if everything succeeds then it removes old resource, if it fails it rolls changes back.

You have an option that allows you to do that with a resource, but it's done individually per resource and it still isn't exactly the same thing.


I don't use CF, but how does it handle rolling back a delete (thus recreating a resource) and resource ARNs? Is it using some functionality not available to other tools?

I always found tf's lack of rollback ability, basically admitting there is no way to really rollback to the exact state, ARNs and all, so it isn't going to try.


>Is it using some functionality not available to other tools?

No, CloudFormation just postpones resource deletions towards the end of stack operations to allow successful operation rollbacks. Even the default replacement strategy is creating the new resource before deleting the old resource whenever possible for this reason:

https://github.com/aws-cloudformation/aws-cloudformation-res...


Cloudformation rollback is not exactly bulletproof either.


Especially initial deploys that rollback and you have to delete the stack to try to deploy again. Waste of time.

On the whole terraform is much faster too.


CF gives you both behaviors, you have option to select which one you prefer.



I don't use it, so don't know if it is a missing feature then request it.

We are talking here about CF and CF gives you control over it.


here is a hint for you: you can tell cloudformation to nuke everything in case it fails.



sam uses cloudformation under the hood. so you can invoke cloudformation directly or demand this is added here: https://github.com/aws/aws-sam-cli/issues/2191


you dodged a bullet imho.


terraform was a great idea but the language makes devops folks have to do a lot of extra work, purely due to language design features. Terragrunt helps with a lot of that, but combined with the version churn, long-standing bugs, and more, I really wanted an alternative that was small, and simple.

Also: why isn't the state file created at command invocation by querying AWS for the state? Or at least have an option to generate the simplest TF for your current AWS state.



Does this basically automate a "terraform import" of existing infrastructure then?


Our organization just started looking at Crossplane instead of Terraform. It apparently gets around the lock issues [1] and is better suited for K8S environments. Some folks I work with think the community will eventually adopt Crossplane over Terraform. Anyone have any experience with it?

[1] https://blog.crossplane.io/crossplane-vs-terraform/


Blows my mind how many people are basing business critical infrastructures on 0.something of anything


Terraform (and hashicorp) don’t follow semver. Terraform has been spinning up infrastructure for small and $100billion+ companies for years, it came out in 2014.

Business critical readiness is not measured whether something has or hasn’t got a “1.0” tag on GitHub.


and in 2021 finally is able to remove the environment that was actually deployed and not the one located in your current directory.

I wouldn't call TF garbage, because real garbage is the native provisioning tool gcp has. It is just waay over-hyped tool.

Reading comments here it is clear that the biggest proponents at most maybe used CF once or not at all.

TF was great when it was released, it was superior to CF back then. HCL was much better then using JSON, but things improved dramatically. No need to worry about storing state, imports/exports where you can share values between stacks. CustomResources (where you can create a lambda function that does anything you need to), I used for example to configure AWS integration with DataDog, where it needs to create resource in AWS then make setup on DD side that create more resources. Combine that with managed stacksets and BOOM, now every account in the organization automatically sets up the integration. Imagine doing something like that with TF.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: