Not sure about that. It’s a clever trick alright but it adds a possibly unnecessary layer of leaky abstraction and makes code reuse more difficult when working with “vanilla” Terraform. If you really want to use a full programming language with infra as code, I’d suggest looking into Pulumi. Or, if you want to (or need to) stick to Terraform, try the official CDK.
I have been saying this for a while now -- as much as I respect Hashicorp and admire their engineering, HCL is the biggest problem with the tools they have that it's used in -- biggest of which being Terraform.
Don't use a DSL where a full programming language is what you need. I can see that they probably wanted to be able to skip writing 2/3 language CDKs out of the box, and DSLs probably hit on the need for simplicity really early, but infrastructure code is somewhere you really want to be able to drop into the full expressive power of a programming language.
I use Pulumi[0] for my projects, and while I doubt people will be able to get it as much adoption as I think it should have in corporate environments, Terraform has introduced the CDK[1] which is a similar approach. Have a talk with your engineers about complexity (and how not to strangle the rest of the team with it) and stop using DSLs in places you need more expressive power.
I think a good indication of when you need to stop using DSLs or when you should maybe realize a DSL wasn't the right way to go is when you start doing things like writing control flow. You could can get away with DSLs if you were using s-expressions, but that is an exception that proves the rule (because there happens to a family of languages that treat s-expressions as syntax).
The moment you need control flow to define your resources, I'd argue that you're verging away from the realm of declarative infrastructure.
I'm using Terraform to manage 10^4 machines in combination with sane CI/CD, Bash/JQ (for dealing with Terraform outputs), Packer and Ansible. Everytime I see somebody reaching to a full programming language to define their infrastructure, they seem to be doing too much with one tool.
Terraform should merely provision things and in that role I find it fine as is. Preferred, even.
It’s not flow control so much as sane, expression-based generation of resources.Terraform has been evolving toward the dynamic with features like for_each, but these features have awful ergonomics compared to something like list comprehensions. Similarly, sometimes you want to reuse some pattern, but the unit for reuse in terraform is the modules which involves a lot of ceremony, so you don’t reach for it as often as you might if the unit of reuse was a simple function definition.
I don’t especially care if Terraform is a simple static language for generating resources and the DRY-ness comes from an exterior Python/etc script that generates the Terraform files or whether the requisite dynamism is built into the terraform language itself, but make no mistake, the dynamism is absolutely essential for maintainable terraform.
I really dislike the idea of declarative infrastructure. It's literally a program that is designed to do one thing, but will change a million things in order to do that one thing. It's Configuration Management for Infrastructure. Yet so many people have this idea that it's something else, like it's supposedly simpler or more effective.
Saying I want an S3 bucket named "foo" is the same as
Did I need a big fat declarative infrastructure system to make that? No. But people want more complexity and features, and they want to make it look simple. So they write big fat applications and libraries to do that. The idea that there's some inherently superior difference of "declarative programming" over "regular programming" is giving people the idea that wrapping everything up in interfaces somehow removes the complexity, or somehow ends up in a better program.
This example is really simple -- it gets more complicated when you want to check things that don't serialize perfectly to strings you can easily grep for.
Once you start writing complex scripts you have a choice -- you either do it imperatively, or declaratively. Eventually you'd come to the fact that it doesn't make sense to just... run imperative commands when you can't guarantee that the other end is idempotent, so you'd arrive at:
- (optionally) take a lock on performing changes
- Check existing state
- Perform your changes based on existing state
- (optionally) release your change lock
And voila, we're at complexity. I'd argue that this complexity is essential, not accidental given the goal of making an easy to use system that ensures state.
> Once you start writing complex scripts you have a choice -- you either do it imperatively, or declaratively.
I don't think declarative programming exists. I think it's just a regular old program with a poorly defined interface. Moreover, I think the claims of idempotence are overblown to the point of near falsehood.
Declarative Infrastructure is really just Configuration Management applied to cloud infrastructure rather than operating system software. Neither have really solved anything, other than turning the management of complexity into a Sisyphean task. Forever pushing drifting state back up the hill.
Compare this to Immutable Infrastructure, where state never drifts. One never "fixes" a container once deployed, or a package once built and installed. One merely rolls back or upgrades. Any uncertainty is resolved in the build and test process, and in providing both all the dependencies and the execution environment.
I think eventually people will wise up to the fact that Terraform is just puppet for infrastructure. I think the real fix is to make the infrastructure less like an operating system and more like versioned packages. Install everything all at once. If anything changes, reinstall everything. Never allow state change.
SQL is due for replacement. The combination of schema and data in one constantly mutating hodge podge with no atomic immutable versioning or rollback is absolutely ancient. Migrations are an okay hack but definitely not good enough.
ZFS and LVM prove filesystems can do snapshots and restores on a version of filesystem history without a lot of pain, so clearly we just need more work here to make it an everyday thing. Versioning should be the default and probably also an infinite transaction log, seeing as capacity and performance is ridiculous now.
And couldn't we lock writes, revert a perpetual write journal/transaction log to some previous version, and then continue a new write history tree? If you run out of space, overwrite the old history. If you don't run out of space, allow reverting back.
And allow bulk atomic updates by specifying a method to write files that aren't 'live' until you perform some ioctl, and then atomically expose them and receive the new history version. Then you could do immutable version-controlled storage on a filesystem, right?
Blob/object stores should be much simpler to do the same with. Just an API rather than ioctl.
In this way, replacing a data store immutably will just be replacing a reference to a storage version, the same as using snapshots, but built into the filesystem/API.
Hm, isn’t there still nondeterminism from a path dependency, because reinstalling a datastore that has arbitrary history isn’t exactly equivalent to creating a datastore with none?
> I don't think declarative programming exists. I think it's just a regular old program with a poorly defined interface. Moreover, I think the claims of idempotence are overblown to the point of near falsehood.
I think this really depends on how you define the term "declarative programming" -- pinning down a singular meaning and a singular interpretation is really hard. If we think about it like a spectrum, there's a clear difference between ansible and terraform like there is with python and prolog. That's "declarative" enough for me.
Idempotence is also really tricky and hard -- I'm not surprised most large codebases can't handle it, but getting close is definitely worth something.
> Declarative Infrastructure is really just Configuration Management applied to cloud infrastructure rather than operating system software. Neither have really solved anything, other than turning the management of complexity into a Sisyphean task. Forever pushing drifting state back up the hill.
While I agree on declarative infrastructure being configuration management applied to cloud infra (especially in the literal sense), I would argue that they have solved things. In the 90% case they're just what the doctor ordered when compared to writing every ansible script yourself (or letting someone on ansible universe give it to you) -- and ansible actually supports provisioning! The thing with this declarative infrastructure push is that it's encouraged the companies themselves to maintain providers (with or without the help of zealous open source committers), so now someone else is writing your ansible script and it has a much better chance of staying up to date.
> Compare this to Immutable Infrastructure, where state never drifts. One never "fixes" a container once deployed, or a package once built and installed. One merely rolls back or upgrades. Any uncertainty is resolved in the build and test process, and in providing both all the dependencies and the execution environment.
People are often using these two concepts in tandem -- the benefits of immutable infrastructure are well known, and I'd argue that declarative infrastructure tools make this easier to pull off not harder (again, because you don't have to write/maintain the script that puts your deb/rpm/vm image/whatever on the right cloud-thing).
> I think eventually people will wise up to the fact that Terraform is just puppet for infrastructure. I think the real fix is to make the infrastructure less like an operating system and more like versioned packages. Install everything all at once. If anything changes, reinstall everything. Never allow state change.
Agreed, but I'm not sure this is very practical, and there's a lot of value in going part of the way. There is a lot of complexity hidden in "reinstall everything" and "never allow state change", and getting that going without downtime -- it requires the cooperation of the systems involved most of the time, and you'll never get away from the fact that there is efficiency lost.
But again, we were talking about the scripts you'll have to write -- in a world that is not yet ready for fully immutable infrastructure, it's just a question of how you write the scripts, not whether an option exists that will prevent you from writing them all together (because there isn't, and most things are not fully immutable-ready yet).
> there's a clear difference between ansible and terraform
The only difference I can see is that Terraform attempts more of an estimation of what might happen when you apply. Otherwise they're the same.
Terraform has multiple layers of unnecessary complexity which were added with good intention (the belief that you could "plan" changes before applying them) but don't actually work in practice. Your state file never reflects the actual state, so it's pretty much meaningless. The plan step is (in theory) supposed to tell you what will happen before you hit apply. But actually knowing it beforehand is impossible.
Part of that is the fault of the providers that don't do the same validation as the actual APIs you're calling do. But the other part is the fact that the system is mutable; it's always changing, so you can never know what will happen until you pull the trigger. The only way to say "only apply these changes if they will actually work" is to move the logic into the system, turning them into transactions (ala SQL).
Honestly, the only reason I use Terraform at all is because writing a bunch of scripts is not scalable. With large teams, you have to use some kind of uniform library/tooling to manage changes. Terraform is currently the best "free" option for that, but I don't find Ansible any more or less reliable, it's just more annoying to use. I definitely don't use them for any "declarative" approach they may have. And in fact, for regular app deployments, I actively do not use Terraform/Ansible at all, and instead write deployment scripts that can manage my particular deployment model requirements. I intentionally abandon the "declarative" model because it's so uncertain (and unwieldy).
> The thing with this declarative infrastructure push is that it's encouraged the companies themselves to maintain providers (with or without the help of zealous open source committers), so now someone else is writing your ansible script and it has a much better chance of staying up to date.
I agree with you here, it's very good that companies can invest in supporting a provider so people can benefit from common solutions. I'm not sure that is specific to declarative infrastructure as much as just being more proactive about supporting their users using their services, though. For example, NewRelic didn't have a Terraform provider until one of their customers wrote one, and eventually they took it over. It's still not great (I have to supplement a lot of missing features with custom scripts calling their APIs directly), but it's better than nothing.
Infrastructure should be defined in an easily digestible, human-readable format.
Your manifests serve two purposes: define infrastructure and self document.
While you can achieve the same infrastructure automation with shell scripts, they’re rarely written well enough to easily understand, introducing operational risk when handed off to other people or teams.
Documentation needs to express the intent of the author and how they arrived at a solution, and more importantly why they arrived at a solution. As someone who's had to clean up "self documented" code I can say unequivocally it will be a disaster. A decade from now we will be untangling thousands of lines of some ancient Python library to understand the intent of infrastructure that could have otherwise been properly documented in 5 minutes.
Yes but AWS CLI commands change over time and don't have a native way of maintaining which version of the CLI you use. Also, you have to maintain that knowledge for however many things you have to do across however many providers.
The point of Terraform isn't to add complexity, it's to have a general way of interacting with a vast number of APIs that's effectively the same and to abstract away the tribal knowledge of knowing how each individual API works.
On the same provider version, you generally can expect Terraform to work the same over time (okay this is less true for say Google provider...) as the CLI keeps evolving.
It's still helpful to understand the providers and their CLIs, but Terraform is a substantial force multipler because of how generic it is across the absurdly long list of APIs that it talks to. That is what its value is.
But it's not generic. I have to track the provider version, the Terraform version, my module version, and any sub-module versions, long-term. Each internal team has to jump through hoop after hoop just to run terraform apply reliably.
I've never yet had to rewrite a shell script that used a new version of AWS CLI. It's very possible that that's only because I've not been using it enough. But even that would be just one level of complexity to manage, rather than four.
And in fact, even within a single provider, interfaces aren't generic at all.
You have to write every single resource of every single provider to be specific to its definition. It would be the same amount of work if you were writing a shell script with curl to plug into each API call. I know, because it was actually faster for me to write a Bash implementation of NewRelic's APIs than to figure out its ridiculous Terraform provider and resources with no documentation.
The only benefit of Terraform is that I don't have to write the API implementation [for most providers]
> The moment you need control flow to define your resources, I'd argue that you're verging away from the realm of declarative infrastructure.
Declarative infrastructure shouldn't be pursued for it's own sake -- what I want is efficient, and simple to manage infrastructure automation. The declarative nature is awesome, but once you start doing plumbing of variables and complexity from one static script from another, the cognitive load of keeping this all in line is better managed with a programming language in my opinion, you're just choosing bash/jq/awk/etc instead of a different language.
I think "the way the declarations are made must be static files" is dogmatic or at least limiting for me. Yes it is absolutely the simplest way to view what's present, but the problem is when someone goes into change any of this they will be dealing with your bolted-together complexity (even if it's not very complex).
> I'm using Terraform to manage 10^4 machines in combination with sane CI/CD, Bash/JQ (for dealing with Terraform outputs), Packer and Ansible. Everytime I see somebody reaching to a full programming language to define their infrastructure, they seem to be doing too much with one tool.
> Terraform should merely provision things and in that role I find it fine as is. Preferred, even.
I can't argue with the efficiency and efficacy of your setup, but I don't think much of this has to do with what we were discussing -- Pulumi does not seek to do the jobs of those other tools -- it's not going to build your VM images or do provisioning (unless you use it that way like with terraform[0]).
Here's a concrete example of a benefit I got form using Pulumi over terraform recently, in some code working with SES:
import * as fs from "fs";
// ... more imports and other lines
// Email Access key
const emailAccessKey = new aws.iam.AccessKey(
`${stack}-ses-access-key`,
{user: emailUser.name}
);
export const emailUserSMTPPassword = emailAccessKey.sesSmtpPasswordV4;
export const emailUserSecret = emailAccessKey.encryptedSecret;
// Write the smtp username and password out to a local secret file
const apiSecretsDir = path.join(__dirname, "secrets", "api", stack);
const smtpUsernameFilePath = path.resolve(path.join(apiSecretsDir, "SES_USERNAME.secret"));
const smtpPasswordFilePath = path.resolve(path.join(apiSecretsDir, "SES_PASSWORD.secret"));
emailAccessKey.sesSmtpPasswordV4.apply(password => {
console.log(`Writing SES SMTP username to [${smtpUsernameFilePath}]`);
fs.writeFileSync(smtpUsernameFilePath, emailUsername);
console.log(`Writing SES SMTP password to [${smtpPasswordFilePath}]`);
fs.writeFileSync(smtpPasswordFilePath, password);
});
I wanted to write information out to a file... So I just did, and that was it. No need to reach for the stack output later and pipe it anywhere -- any time pulumi runs it will update that variable if/when it changes, and the next tool (which requires the file at that path to be present) will continue on without knowing a thing.
I can't say that this is perfect Pulumi code (ex. I could have defined a custom Resource to do this for me), but I have saved myself having to do the plumbing with bash scripts and terraform output awk-ing, and the information goes just where I want it (NOTE: the secrets folder is encrypted with git-crypt[1]). When someone comes to this file (ses.ts), they're going to be able to easily trace where these values where generated -- similar with bash scripts, but now they don't have to be a bash/awk/jq master to manipulate information. There are definitely some gotchas to using Pulumi (like the `.apply` there), but in the end, I'd prefer to make changes like this in a consistent language I like (Typescript).
My toolkit looks very similar to you, except I basically only use make + kubectl + pulumi + ansible (rarely, because of the kind of servers I rent).
So what's the plumbing that you would have to do? Under the hood, Pulumi is using the Terraform providers...
(I left out username, because I don't see where you're setting emailUsername)
In my pipelines I don't bother writing out to files things that are in terraform state. I just create an output for that state (potentially set to sensitive) and then use that output in my CI/CD. Remote state stays encrypted and without wide access and I don't have to worry about secrets being in files anywhere.
That's where the bash scripts do things with outputs. It could by python or whatever, it doesn't matter really. But with bash I can easily just set variables to `terraform output -json | jq <select output &/|| do stuff>`.
Mainly all I do is write terraform outputs to vault (i have simple bash automation to do all of this) and then I can use the Vault secrets in other CI/CD pipelines.
> You can just write information out to files in Terraform with no stress.
This wasn't the point -- it was that I wanted to do something that I know how to do in a fully featured programming language, and I can "just do it". Writing local files is a very simple example -- unless you're arguing that terraform's capabilities amount to an entire language's ecosystem, this is just the tip of the iceberg.
> So what's the plumbing that you would have to do? Under the hood, Pulumi is using the Terraform providers...
I think I didn't explain it well enough. Pulumi and Terraform are almost the same tool, but the difference is how the pieces are plumbed together. I prefer plumbing with a programming language more than shell scripts, utilities and/or some other programming languages.
Also, Pulumi's system of custom resources which are just pieces of code sitting in your codebase is fantastic and novel (terraform has custom providers but this feels significantly more heavy weight).
> In my pipelines I don't bother writing out to files things that are in terraform state. I just create an output for that state (potentially set to sensitive) and then use that output in my CI/CD. Remote state stays encrypted and without wide access and I don't have to worry about secrets being in files anywhere.
If you're really adjusted to Terraform, and it works great for you, then awesome -- I'm not out to change how you do things. It sounds like you've fully bought in to the terraform way of doing things, and it's working for you, and that's great.
> That's where the bash scripts do things with outputs. It could by python or whatever, it doesn't matter really. But with bash I can easily just set variables to `terraform output -json | jq <select output &/|| do stuff>`.
Here it is again... My point is that the plumbing matters, and a fully baked programming language offers the possibility of better plumbing. I didn't touch on it much, but just having access to custom resources with Pulumi might be able to cut down the external plumbing to zero, and enable creating more reusable pieces.
> Also, Pulumi's system of custom resources which are just pieces of code sitting in your codebase is fantastic and novel (terraform has custom providers but this feels significantly more heavy weight).
Pulumi literally uses Terraform's custom providers as its dependencies under the hood to make this work.
Moreover, you're entrusting your entire production stack to an ultra-aggressive, hypergrowth, early stage startup...
Your statement about providers as dependencies is inaccurate in a number of respects:
- Pulumi has the ability to use the CRUD logic from Terraform providers to reify resources, but that is one of a number of different approaches it can use.
The Kubernetes new Azure Resource Manager providers are instead built via the API specs and additional annotations, and involve no aspects of Terraform.
- (Less importantly, but still:) Terraform does not have the notion of “custom” providers as a technical construct - first- and third-party providers implement the same protocol and have the same capabilities. There are just “providers”.
Disclaimer: I have contributed to Pulumi and now use it day to day at a large company. I also was a core maintainer of Terraform at HashiCorp for several years.
> Pulumi literally uses Terraform's custom providers as its dependencies under the hood to make this work.
Right but you can see that the interface is easier as a custom resource though right? The Pulumi docs on making a custom resource are just extending a class. Making a good one I'm sure is fraught with peril but it's so much easier than trying to extend terraform as far as I can see, I'm glad that Pulumi has done this for me.
> Moreover, you're entrusting your entire production stack to an ultra-aggressive, hypergrowth, early stage startup...
Source code is Apache 2.0 and available[0]... I'm not against them trying to profit, but they've created a useful thing that is licensed very permissively (they could have gone with BSL or whatever else)...
> but infrastructure code is somewhere you really want to be able to drop into the full expressive power of a programming language.
Please expand on what this expressive power you need is. I see things like Pulumi and I can only think of IaC codebases that end up being turing tarpits in the hands of developers that don't get what the whole point of the declarative model is, putting a bunch of untestable arbitrary IO and high cyclomatic complexity in the middle of determining what is going to be deployed.
And this is definitely not an argument against having better languages than HCL, Dhall would certainly be a step up that doesn't give the developer an arsenal of footguns to move fast and wreak havoc on the underlying foundations of a business. But I want to know what needs you have that you feel are worth the risk of putting a bull in the china shop.
Oh wow, I had no idea that terraform had implemented a CDK frontend.
That's awesome. I've been writing a bunch of regular CDK code, which is AWS specific, I loathe HCL, and I was a bit disappointed by the quality of pulumi when I evaluated it.
Il have to check out the terraform CDK support.
Care to share your thoughts on Pulumi quality? I’m asking since I’m currently looking at their TypeScript implementation to use in prod and would love to know any issues that may pop up.
I didn't really get that far with it, but the first issue I ran into, was that the program wouldn't work at all, just errored out with no logging, even after passing all available debug flags.
Spent hours digging through the code trying to figure out what the issue was, figured it out, reported it, even provided the solution, and the github issue remains open over a year later[0].
The other thing I didn't particularly like about it, is that it doesn't really seem feasible to develop using it, without it having access/credentials to your cloud environment.
The great thing about CDK, is that I can synthesize templates, write unit tests, and have a reasonable amount of confidence in my changes, without actually running it against an environment.
Don't get me wrong CDK/Cloudformation still fails plenty at deployment time, but it's better than nothing.
I’m super curious about pulumi — I just inherited a mess of terragrunt at a new gig and was plannning to rewrite it anyway, is it mature enough for a reasonably standard aws/eks setup?
As much as I like Pulumi, no one gets fired for choosing Terraform. You're going to be able to find the most help, discussion, and stuff for Terraform. Terraform and it's ecosystem is the safer choice, especially when the CDK is an option.
On the other hand I can say that a bunch of Pulumi's stuff is built on terraform underneath the covers so I'm not sure how far behind they are but it probably isn't by much. In my limited use (for example deploying SES stuff) I found it a pleasure to use and didn't find anything that terraform did that it didn't (again, likely due to pulumi being able to utilize terraform providers under the covers).
> I just inherited a mess of terragrunt at a new gig and was plannning to rewrite it anyway
Hope you're really sure this is where you want to use your effort tokens -- I'm not sure how much of a mess with it, but these tools are so new, it might be worth seeing if you can de-clutter it without a complete rewrite. Or maybe it's small enough that a complete rewrite is relatively low friction... Either way, there be dragons.
After many years working with these tools I agree. Declarative and template DSLs are OK for basic things but you quickly bump into their limitations when you want to do advanced configurations. At this point administrators should know how to code so being able to use a normal programming language with IaC modules should be the standard.
Amazon's CDK for AWS is a step in the right direction. Microsoft's new Bicep for Azure misses the mark in my view because it is yet another DSL and not a real programming language.
It seems like Nix and NixOps are designed for this type of thing, since they're built on a functional language instead of yaml. But I haven't played with them too much yet -- anyone have experience using them?
What I don’t get about “modern” development is lack of complexity management.
It feels like 8 degrees of work to write HCL that any component developer can do.
What maddened me was the lack of support for the count parameter for modules. Made me rage. But not enough to switch to tools like this or troposphere.
Not to mention the supply chain implications and security risk that comes with it.
I don't know about the author's issues, but I wrote something like this because early versions of HCL were very limited. E.g., you couldn't write a module that cranks out autoscaled DynamoDB tables per region because you could only pass scalars to modules, not entire key schemata or tag sets.
I really wish Terraform itself were written in Python or something similarly hackable without rebuilding the binary for everyone. I've glanced at CDK but it looks like building blocks for reinventing our own Terraform.
Looks like you get a runtime explosion mad then some. Your CI will need all of the runtimes used plus the extra code to handle multiple language runtimes in Pulumi via the automation api.
With Cue, there is the possibility to import TFs definitions into Cue, write your IaC, and output to JSON. Like parent says, Cue feels like the right way to approach this IaC problem with a purpose built language. It was designed to manage this complexity.
Does anyone know if there's OSS projects we could use to replace Terraform with Python + boto3? I specifically do not want to use Pulumi. The dependency validation wouldn't be too hard to implement, but all the inter-resource and feature-specific logic would need to be duplicated.
The reason I won't use Pulumi is 1) licensing and 2) corporate ownership. I'd rather use CloudFormation.
I feel like a lot of people are missing the fact that Terraform's value is in being a generic way to interact with a vast number of APIs.
But then when I look at things like Pulumi, I'm reminded that software engineers have their hammer and tend to see all problems as nails.
That's not a knock on the profession, seeing as I am one, but writing Terraform serves business needs that writing and deploying software does not. It takes a certain amount of maturity across the industry to not tunnel vision and write code to solve every problem.
Terraform also knows how to orchestrate those API calls in dependency order, update in place when possible or else recreate (reusing the cloud platform’s unique ID from creating each resource), and check for drift (because sometimes people alter the actual cloud resource but not the template file). The problem is that some cloud APIs require a lot of boilerplate params to set up correctly (looking at you, aws_appautoscaling_policy), and HCL’s unit of reuse makes it fairly hard to factor out what our resources have in common and end up with template files clear enough for a reviewer to catch mistakes.
Yaar, not everything available in the resource is supported in the cloud API (hello Elasticbeanstalk) but Terraform is still infinitely better to use and work with directly.
You will hand up recreating a lot of what Terraform is doing. It is not a matter of just creating resources if they don't exist. Notably, modifying existing resources in the right order is not trivial.