Hacker News new | past | comments | ask | show | jobs | submit login

I hate Terraform with a passion but it is probably the best tool out there for managing cloud infrastructure so I use it at work with no plans to replace it.

The biggest downsides are the awful half-baked language and the awkwardness of modules and passing values throughout your config. Also the staticness of providers are a serious pain, for example you can't create a kubernetes cluster then add a resource to it. The work around is to use two separate Terraform stacks which brings a lot of pain for passing values across the boundary. Furthermore you can no longer effectively plan any change that affects the boundary between the two stacks. "Luckily" Terraform's performance is so bad that you need to split the stacks anyways.

The biggest feature I would like to see is the ability to dump a pure representation of your evaluated configuration. This would allow reasonable diffs in CI. There are of course complications, especially if you use `data` resources but technically it is possible to do a very good job here which would make it so much easier to make changes.




I strongly agree both with respect for the half-baked-ness of the language and with the "it's probably the best out there". Ultimately, these tools should have a static/yaml-like "assembly language" that describes the state of your infrastructure without any of the DRY. There would be a diffing engine which would figure out what changes need to be applied and apply them accordingly. Users could use some vanilla programming language to generate that yaml in a DRY way; then the Terraform folks don't need to badly reinvent a programming language.

I know they also have a CDK, but I can't tell if it properly solves that problem or if it still forces us into Terraform idiosyncrasies (i.e., if I rename something in Terraform, it will try to delete the corresponding resource and recreate it, and I think that absurd behavior remains with the CDK).


100%. Terraform is half-way between a tool for generating the configuration and applying it. I think Terraform's application engine is actually quite good, but I would like to use a much better tool to generate the config. (And be able to diff that config)

You can feed JSON to Terraform however this falls over if you need dependencies for output values. This usually isn't an issue because most Cloud provider resources have predictable IDs but as soon as you have one that doesn't you are up for a lot of pain and suffering.


You may be interested in Pulumi: https://www.pulumi.com/

Basically it's Terraform but instead of declaring your resources in HCL, you declare them in a real programming language. You're still producing a declarative config that the engine then diffs, applies etc. In fact, it's compatible with existing terraform providers, so it has a surprisingly large selection of things you can use it for.

Note their docs will try to guide you towards using their hosted service which basically does nothing except host the state file, but you can use an S3 or GCS bucket instead and it works fine.

It's definitely not without its own problems, but I'd say it's overall an improvement.


Unfortunately last I checked, pulumi only offers state locking with their paid service. If you want to self-host you have to implement it yourself, which seems like a non-starter for a lot of people.


This was addressed a couple months ago in https://github.com/pulumi/pulumi/pull/2697


Wow it took 2 years for the PR to get merged.



Glad somebody mentioned Pulumi. It solved all of the major problems I had with Terraform.


Not with that licensing thanks


It looks like it's Apache 2.0 licensed? Wh issues do you have with that licemse.


It’s Apache 2, isn’t it? What’s wrong with that?



Someone should make a Clojure demo of those Java bindings, or even cljs. I hope Clojure has good type based completions these days, because it would be a fantastic language for this.


It’s pretty wild that the object identity via name thing is still a problem. Can they not add a transitional name feature where an object is known by multiple aliases for a while and then when you have finished putting though a change, you can delete the original name? Is this not very basic SQL migration practice? Like column aliases until no longer needed.


I don't even understand why the state needs to know the identifiers that the high level language uses for various resources. If the high level language has a binding "foo_bucket" for an AWS S3 bucket resource with a single property `name = "foo"`, then why should the state need to know that the high level language refers to that bucket with the name "foo_bucket"? Instead, the state should look something like this (obviously simplified):

    {
        "resources": [
            {
                "type": "aws_s3_bucket",
                "properties": {"name": "foo"}
            }
        ]
    }
Note that there is no reference to "foo_bucket".


This doesn't make sense to me. You need to know the logical identifier in order to explicitly link the code with the resource. Otherwise if I change the code for that resource how does TF know what it needs to change if none of the existing resources in state matches the new config? Do you just always destroy and re-create every time there's a change to anything?


> Otherwise if I change the code for that resource how does TF know what it needs to change if none of the existing resources in state matches the new config?

A resource provider defines a collection of fields that is the "identifier" for the resource. For example, an S3 bucket resource would have the "name" field for its identifier.

If you change another attribute besides the bucket name, the engine will see that the input and the state both have a s3 bucket resource with the same name but different props, so it knows it will need to update some props (rather than create a new one). However, if the name changes, the engine will see that the input has a bucket that doesn't exist in the state so it will add a "create bucket" step to the plan. It will also see that the state has a bucket that isn't in the input, so it will add a "delete bucket" step to the plan.

Maybe another way of saying the same thing is that a resource provider can mark any given field as "forces replacement", and all of the fields that force replacement are the de facto identifiers? I haven't thought through whether these are exactly equivalent.


The "identifier" is often something that's computed later or returned from the API. Think about something like an ec2 instance - the identifier is the instance ID that's returned from AWS. You can have many instances that basically look identical so how do you differentiate which one this logical resource is referencing?

And back to the s3 bucket use case sometimes you want uniqueness in your name so you use a prefix instead of specifying the whole name - how do you determine which bucket that resources is referencing if there are multiple buckets matching the prefix?

I hear what you're saying in terms of wanting state management to be simplified, but pretty much every IaC solution uses this explicit logical resource -> physical resource mapping in state.


Yeah, moving objects around the config is common if you want to keep it organized and requires manual actions that require essentially a global lock on the stack (and Terraform has no built-in feature to actually take this lock). It makes it basically impossible to implement a fully automated production change pipeline with Terraform.


Moreover I can never, ever, remember the syntax for moving objects around the config. It's really painful.

Edit: the aliases would have to handle moving as well as renaming. You could just have aliases in a global namespace, which means adding `alias = "portable-elb"` and doing one `terraform apply` means you can pick up that config, drop it anywhere else, and it will move it for you. It wouldn't even need to do a full `apply`, just a local JSON manipulation.


> application engine [vs] tool to generate the config

I get it from HashiCorp's perspective though.

A robust application engine with a suboptimal config generator is a viable product.

A suboptimal application engine with a brilliant config generator is not.

So given limited resources, former gets the dev grease.


This is a false dichotomy.

You can generate these configs really easily with any off-the-shelf programming language for a small fraction of the effort they’ve put into HCL + all of the stuff on top that makes HCL the shitty programming language that it is.

Even if you insist on building your own programming language for this purpose, Hashicorp could’ve saved themselves a lot of work by looking at the prior art of the last 70 years of programming language history.

In other words, if they just picked, say, JavaScript from the start they could have saved a bunch of time and energy and put that into their application engine.


> You can feed JSON to Terraform however this falls over if you need dependencies for output values

This is what I've started doing with Jsonnet for generation, and also exactly why I've stopped doing it.


I'm not sure I follow exactly what you're missing. `${aws_instance.example.x}` as a string value creates the same dependency as it would via HCL when used with JSON.


Same here, I don't see how outputs is being treated any differently by Terraform than any other .tf file written in HCL. I'm not saying it's not possible, but I haven't experienced a failure more there yet.


Thanks for the hint, now I'm not sure what went wrong when I tried something like this. I should read up on this more.


What are some of the tools that do this? The only ones I know of are Scalr and Pulumi.


> Ultimately, these tools should have a static/yaml-like "assembly language" that describes the state of your infrastructure without any of the DRY.

CloudFormation ?

> There would be a diffing engine which would figure out what changes need to be applied and apply them accordingly.

CloudFormation.


Problem with CloudFormation is that it doesn't work with Cloudflare, Azure, GCP, Big-IP, Palo Alto, NetBox etc..


Its a problem only if you use these vendors, you don't have to.


It's a pretty tough sell to tell people they have to uproot all of their existing infrastructure and move to Amazon just to use an infra-as-code tool.


It's also unlikely that you will only use AWS, forever. At some point in time you'll have to deal with various resources (be it IT resources, time, money or people-as-a-resource), and whenever you bind your knowledge and workforce to an IaC tool that doesn't transfer or isn't portable you're going to end up with N+1 tools every time. In other words: it doesn't scale all that well. (And that doesn't mean Google-scale, but going from 2 IaC engineers to 5 IaC engineers is much harder if you can't apply universal tooling)

Tools are never 'just tools', there is context and there are externalities. And as you already pointed out: migrating/uprooting all of those other things isn't a likely scenario.


Agreed. If you use an auth service (SaaS or self-hosted) that isn't AWS Cognito you will also find yourself wanting to integrate with your IaC tool. Having to roll this yourself with CloudFormation is a lot of effort, or at least it was last time I looked, and importing a third party "provider" wasn't really a thing.


Fun fact: You don't even have to use Terraform


Yeah, CloudFormation is workable in this regard (I've created a neat generator for Python), although it has lots of its own problems (e.g., if you want to create a new resource, you have to run it as its own lambda--your infra-as-code needs its own infra which needs its own infra-as-code).


> I've created a neat generator for Python

care to share? (I know some hn users often don't w/o being asked, out of a sense of not wanting to be seen as self-promoting.)


It’s hanging out in a private repo with a bunch of other stuff and I don’t care to put it in it’s own repo at the moment. Basically CloudFormation publishes a JSON spec of all of their resource types and I use that to generate Python code with type annotations. It’s sort of like Troposphere, but I go further—Tropo makes you reference resources by their cloudformation string names, but my tool lets you use the Python object containing the resource and it will resolve to the correct CloudFormation “Ref” object at compile time. (also, unlike tropo, I generated my Python types from a spec so I don’t have to keep up with AWS changes). That said, I’ve given up on CloudFormation altogether since Terraform has better support for resources outside of AWS.


>if you want to create a new resource, you have to run it as its own lambda

Please don't, lol


> they also have a CDK

Terraform-CDK, as of now, needs to go through standard HCL parser. Sadly, there is no backdoor into Terraform's internal structures. If HCL (as a language) is the limitation for you, the CDK does not let you fly around it.


This would be great. Perhaps it could be based on https://dhall-lang.org/


I absolutely think a statically typed language is the right way to go (from experience using a Python->CloudFormation generator even with Mypy), but Dhall is going to be really unfamiliar for most people and it's hard to sell people on new languages that are syntactically unfamiliar.

As an aside, I think functional concepts could have made their way into mainstream programming much earlier if the FP people would have been willing to lower themselves to syntax that is readable to us plebs--I think this is no small part of Rust's success. People say syntax doesn't matter, but I disagree.


https://cuelang.org has better syntax but its logic based unification is a struggle bus for many people.


I looked at Cue and I don't understand what problem it solves. It certainly doesn't (seem) to solve the problem of DRYing up verbose YAML, or at least it's missing any notion of a function.

"hey, these YAML blobs are all mostly the same, but they vary based on a couple of parameters--I should write a function that takes those parameters and outputs the right YAML object"

^ This is the #1 thing that the high-level language should concern itself with. Static typing is really nice to have and it's cool that Cue has a pretty interesting type system, but (as far as I can tell) it doesn't have functions. It almost has functions, but I don't want to have to resort to a hack for the #1 thing that I care about (functions).

Considering I prefer functions over sane syntax (although sane syntax is roughly tied with static typing), I'm inclined to prefer Dhall over Cue, but I'm still optimistic that something better will emerge. Also while we're on syntaxes that are deliberately obtuse, I'm pretty sure the Nix community has a Nickel language which is basically a statically typed version of the Nix language.

Maybe Cue has a more enlightened way of thinking about the infra-as-code problem and I'm just not getting it.


CUE's philosophy is to wrap code in data, not data in code, as learned from the major configuration systems at Google. Being a logical language, rather than telling the computer what to do, you state facts and it verifies that you are correct. It is also intentionally not Turing complete do that you cannot program in CUE.

CUE is gaining traction while still being young and changing. Grafana is adopting it for validating dashboards and such. Expect to see it more in DevOps too


When I stopped being an SRE at Google, my most immediate thought was relief that I would never, ever, have to deal with BCL/GCL again.

After 6 months outside Google, I desperately wished for BCL/GCL to be everywhere, because all other config languages were just plain broken. And more annoyingly, there's no better way to describe it than "I have seen better, just trust me".

CUE seems to be a step forward. Flabbergast looked like it might have been a contender. The latter is DEFINITELY inspired by BCL/GCL.

At some point, I will have to sit down with CUE and try to re-implement the "perfect little horror" in it (it should be impossible IFF CUE is not Turing-complete, but it actually turns out that there are edge cases of configuration where you want that Turing-completeness).


> Being a logical language, rather than telling the computer what to do, you state facts and it verifies that you are correct.

Sure--it's like advanced static typing for static configuration. But that seems like a different and lesser problem than DRYing up the configuration in the first place, and moreover if you use a statically typed programming language to DRY up your configuration then you get pretty similar guarantees to Cue. You don't get Cue's "unifying many definitions" approach, but I can't honestly discern the value proposition in that.

As for turing incompleteness, that's a nice to have at best. If I had to choose between a turing incomplete declarative language like JSON and a turing complete imperative language like Lua, I'd take the latter every single time.


Nah, the syntax is superficial. Scala has offered better-than-Rust FP in a traditional syntax for over a decade, but if anything the tension between imperative and functional people is worse there.



You can reduce a little bit of the repetition in YAML with anchors.

There are tools that convert JSON/YAML into HCL.

https://learnxinyminutes.com/docs/yaml/#:~:text=yaml%20also%...


I think you misunderstand the problem I'm trying to solve, or maybe I misunderstand your response. My goal isn't to write YAML instead of HCL, my goal is to get rid of HCL and Terraform semantics altogether. If I had my way, Terraform's low level engine would operate on a verbose (i.e., "not DRY") YAML (or JSON or HCL or I don't care) description of resources which would be generated from (for example) a Python script.

The Python/Go/etc script is what humans interface with, and it is DRY. The YAML/HCL/etc is what the Terraform engine operates on and humans should very rarely need to interact with this.


Ah, so like you have some process which generates your YAML/HCL, which is your "IR/assembly" layer, not meant for regular human consumption/editing, which is fed to Terraform. But it's readable/auditable, VCS-trackable, and diff-able.

I do that a lot as well and in fact I'm kinda leaning towards taking that approach from the get-go. Right now I start with the YAML, but then something makes inevitably leads me to templating it using make + jinja/gomplate, which eventually leads me to wanting to use python scripts, and then invoke (python package, it's like gulp or make).

It's not code, like business logic code, but it's too verbose and repetitive for human manual editing.


Yeah, in the Kubernetes world, the official interface is the YAML/assembler and different people have come up with different approaches for generating that. Helm for a long time (and even currently) uses text templates (e.g., jinja, mustache, etc) to render that YAML which is predictably abysmal.

CloudFormation used JSON (and eventually YAML) but built on top of it language-like facilities (the ability to reference resources, call pseudo-functions, etc) all very poorly. So you get an impoverished language built on top of YAML.

Terraform decided they would do approximately the same thing, except they reinvented their own JSON/YAML alternative (HCL) and built a crappy programming language atop it (instead of atop JSON/YAML).

These all give you pretty crumby means of abstraction. CloudFormation you get nested stacks instead of functions and you can only pass scalars around (no objects or lists--except comma-delineated strings which can be parsed into a list of strings). You're also limited in how many nested stacks you can create and how many total parameters can be passed into any given top-level stack.

Terraform seems strictly better. You can pass objects and lists and I've never approached any parameter limits, but still, you have to create a whole directory just to define a function and refactoring existing code into a module is painful because it means renaming resources (putting them under the module) which Terraform interprets as intent to destroy and recreate the resource.

Helm is using text templates so you can even generate syntactically invalid YAML! I think they might be supporting Lua these days, but I haven't looked into it.

I think the idea was that the whole marketing push behind infra as code was "it's just YAML! Such declarative! Wow!" as though yaml magically simplifies the inherently complex task of infrastructure, so everyone started with something YAML-like--even though we absolutely should have known that we would need to abstract--and gradually built our own half-baked languages on top of them. Of course, infra as code is absolutely worthwhile, but it's the ability to define what you want and have a tool reconcile it with some current state--it's not some magical property of YAML/JSON/HCL/etc.

fin.


That's an accurate summary of the arc of progress in this area. Also explains why so many folks are now turning to operators (versioned procedural code that runs in k8s and does arbitrary things, rather than arbitrary versioned yaml artifacts applied to k8s) to do advanced stuff rather than layering on more templating duct tape.


> these tools should have a static/yaml-like "assembly language" that describes the state of your infrastructure without any of the DRY

the last five words are a bit of a double negative; i think you mean "without the repetition" but I can't tell.


"without DRY" in this case means "with repetitions" i.e. in a verbose way. GP wants to be able to generate this verbose, machine readable syntax with DRY, human readable syntax.


Yes, this. Thanks for clarifying for me, apologies to the parent for my lack of clarity.


Dang, your solution sound so much like kubernetes I'm not sure if you are joking or not.


Kubernetes is one conceivable incarnation, but it operates differently than other infra-as-code tools. Terraform, for example, builds a dependency graph of your resources and initializes them in order. Kubernetes doesn't care about dependencies, and it just keeps trying to create resources and things will fail until their dependencies come online.

Further, Kubernetes manifests are the verbose "assembly language" layer, so you still need something for humans that is DRYer.

We use Terraform to manage Kubernetes resources (as well as cloud provider resources) at the moment, but I think you can equally use cloud provider operators for Kubernetes and manage everything with Kubernetes--I haven't tried this yet so I can't comment. In the latter case, you would still need something to DRY up your Kubernetes manifests. Also, if you aren't running on Kubernetes and you just want infra-as-code, k8s is an expensive solution (in terms of operations).

What I was picturing was a more conventional infra-as-code diffing engine (like Terraform's) but with a more verbose interface similar to Kubernetes YAML.


> Kubernetes manifests are the verbose "assembly language" layer, so you still need something for humans that is DRYer.

It's a little more than that. Out-of-the-box manifests for primitives are certainly assembly-like, you're right--but CRDs allow you to operate at a higher level of abstraction while staying in the same syntax, which is powerful and unique to k8s (everything else, from Helm to Terraform to Ansible, distinguishes between pseudo-assembly "language that directly expresses changes to be made" and "language that humans can write abstractions in").


> "Luckily" Terraform's performance is so bad that you need to split the stacks anyways

Not sure what about terraforms performance is so bad. Seems hard to blame a tool who's main execution path is potentially 100's of network IO requests with 3rd party API's. Most of the "split stacks" I've seen is more for code organization and security reasons rather than performance. Seems safer to know 100% that deploying infra for my app isn't going to mess with my VPC settings and can be executed with a lower privileged role.

> Furthermore you can no longer effectively plan any change that affects the boundary between the two stacks.

That's fair -- you do end up with these "foundational" modules a lot of the time. Like an 'aws-account basics' module or something that other modules expect the account to be setup with that base for being able to query data objects for subnets ect... planning changes if that changes be difficult but not impossible. Good versioning is critical. Feels in the same vein as apps that need to manage framework updates and things like that. (though can be made more difficult or easier based on how you've broken up using your cloud provider -- multiple accounts by buisness unit or all in one).


Our experience of building a provider: performance is fast with fast APIs, and slow with slow APIs. Haven't observed any of the core diffing, DAG, or apply scheduling to be problematic (but also haven't tried an apply at extremely high - 10^4? 10^5? - resource count)


> The biggest feature I would like to see is the ability to dump a pure representation of your evaluated configuration. This would allow reasonable diffs in CI. There are of course complications, especially if you use `data` resources but technically it is possible to do a very good job here which would make it so much easier to make changes.

The planned state, current state, and diff of them are all available as separate fields in the Terraform plan file, is that not what you're looking for?


The key word is "pure" here. These things all depend on the current state of the infrastructure. The "planned state" is close to what I want, but it can be very confusing if someone has deployed a new change since you forked off.


Yeah. I have a poor view of terraform since my first interaction was trying to a few one line changes to avoid repetition but couldn't find why it didn't work without setting up connection to the AWS S3 bucket.


Have you tried Terragrunt [0]? It helps a lot with managing a set of related stacks. Still feels like a bandaid on a broken model, but it is what we have.

[0] https://terragrunt.gruntwork.io/

Regarding performance, last time I looked, Hashicorp's documentation implied there was no limit to the size of a Terraform stack. I think they meant theoretically in a science fiction universe where humanity had captured all of the sun's output to perform terraform plan and apply...


+1 from me on the "awful half-baked language" (HCL).

I just recently wrote an article about my experience, including issues and workarounds, when migrating from Terraform to Pulumi: https://blog.ekik.org/my-experience-migrating-my-infrastruct...

Hope it's OK that I'm sharing it here. I think it's relevant because there seems to be quite a lot of interest around Pulumi, and how one would go about moving from Terraform to Pulumi.


I'm actually thinking of going the other way. I've been using Pulumi for several months now, and I'm thinking of moving to Terraform, because it has a so much larger third-party ecosystem, including more providers, and tools that can analyze HCL, like Infracost and security scanners. When will I learn to see the bigger picture and value popularity over quality?


It's a very interesting point.

I've been part of managing rather large Terraform infrastructures (1000+ resources) for a couple of years, but I'm a Pulumi n00b with only about a month of experience.

The infrastructure I'm managing right now with Pulumi is much smaller, only around 130-140 different resources.

For me it ultimately came down to developer productivity. I'm much better at convincing Pulumi to do what I want compared to how it was with Terraform. This also makes me a much happier and less frustrated developer :).

My priorities might very well be different if I were to manage much larger infrastructures (infra cost would be more important for example).


The stack I manage with Pulumi is currently around 300 resources. (I think that count is inflated by all the secrets in AWS Secrets Manager, because each secret has two resources: the secret and the current version.) I currently manage it by myself, but I'm hoping that won't be the case for very long.

Maybe the ending of my previous comment was too cynical. But I think I've repeatedly made the mistake of valuing my productivity and happiness as a currently solo developer over what will let my company take full advantage of a big third-party ecosystem (including a large talent pool).


I don't think you're too cynical at all - I think you're exactly right! It's often much more sensible to use the "tried and true" stuff most of the time.

In my particular case I don't plan to have my company grow much at all - we're staying small. I think Pulumi is a sensible "bet" for me, because it does what I need right now really well. Sure, there's a bit of a risk, but worst case scenario I would spend a day or two to migrate what I have back to Terraform.

I would definitely not have made the call to "let's just switch everything to Pulumi" if I was still working at a larger company. As you said, a large talent pool / community is a huge deal when you have the option to hire people who can spend time learning a particular tool or language.


I work in a very large shop with lots of TF and we do not use any of the "ecosystem" other than Terragrunt. Almost all of it is experimental junk.

We use almost entirely one provider, with things like a "template" or "random" provider as well, which are really just core features they decided to split off into plugins. Even when we use SaaS that there is a provider for, we don't use the provider, because we aren't constantly changing it, or managing it doesn't require lots of people across multiple teams with multiple iterations and modules.


+10 from me on the "awful half-baked language" (HCL).

Only cmake's 'language' is worse.


People mention pulumi but hashicorp are creating something similar with https://github.com/hashicorp/terraform-cdk. But all the existing terraform providers work with it afaik.


I don't know if people have even tried Pulumi before recommending it.

I've tried it, and it has buggy defaults, diff generation, etc. Each time I applied the same code, it would generate a diff based off of some internal defaults and... recreate the exact same infrastructure by _tearing it down_ and making it fresh. Not ideal.

Would advise using the TF CDK specifically.


The token system is broken in TF CDK still and it's not ready for adoption. I've built two stacks with it but I'm back at terraform for now. I intend to explore pulumi though when the opportunity presents itself.

I think using a Turing-complete language like typescript with mature tooling to define cloud infrastructure feels very natural and makes things much more manageable than using HCL.

One thing I absolutely can't do without is the state management api terraform provides with its CLI. This is absent from terraform-cdk and aws's CDK, although many of the same APIs seem to exist for pulumi.


> I think using a Turing-complete language like typescript with mature tooling to define cloud infrastructure feels very natural and makes things much more manageable than using HCL.

Fully agree. Not sure if any of the CDKs (or Pulumi) get the ergonomics right though. The ergonomics should feel like we're just generating YAML/JSON/etc, but the CDKs I've seen require inheritance, mutable state, etc.

> One thing I absolutely can't do without is the state management api terraform provides with its CLI. This is absent from terraform-cdk and aws's CDK, although many of the same APIs seem to exist for pulumi.

AWS's CDK is built on CloudFormation, so I don't think it has analogs for Terraform's state APIs. As for TF CDK, I would think you would just use Terraform's CLI state management directly? Maybe I'm confused about what you're trying to do?


@throwaway894345 You can, but that means you have to introspect the generated code to determine terraform resource ids etc. A really bad developer experience on large stacks.


> This is absent from terraform-cdk

Curious to know how that is, or what an example would be? I don't see how you would have to give up state management with CDK, which I understand to be extending TF, not supplanting it.


@polynomial - You have to use the state API on the generated terraform. This means that you need to understand the structure of the generated terraform, and are dealing with generated .json files that require introspection to determine what terraform resource ids are prior to managing their state. It is possible to do, but if you're writing code, you don't want to have to worry about the generated json.


I wouldn't recommend using cdktf either yet. Can't manage multiple stacks in a single repository, no full support for input variables, constant breaking changes. It's not production ready at all.

Stick with terraform if you need to provision non-aws resources. Otherwise, use aws-cdk.


I do multiple stacks via changing the state file based off of env:

  constructor(scope: Construct, name: string, c: StackConfig) {
    super(scope, name);

    new S3Backend(this, {
      bucket: "some-bucket-here",
      key: c.name("state-env"),
      region: "" // wherever
    });
  }
 
  // ... at the bottom of main
  new Stack(app, 'something-something-dev', { environment: "dev", name: (i) => `${i}-dev` });
  new Stack(app, 'something-something-prod', { environment: "prod", name: (i) => `${i}-prod` });
Then you can use stacks properly.


Support for multiple stacks in a single file was added to cdktf recently. I’ve been managing dozens of production stacks in a single repo for a while now and highly recommended it.


And yet if you try to pass values from one stack to another, it will fail spectacularly.


> Each time I applied the same code, it would generate a diff based off of some internal defaults and... recreate the exact same infrastructure by _tearing it down_ and making it fresh. Not ideal.

Not quite the same, but in vanilla Terraform if you simply rename a resource it will tear it down and recreate it even though the resource itself hasn't changed. Makes refactoring really painful. I think you can work around this by renaming the state as well as the resource, but this is often a lot of work (and a bit of risk) just to rename an identifier so I don't bother. I suspect the CDK doesn't solve this problem either.


  terraform state mv [old name] [new name] 
I'd much rather explicitly state when real resources are renamed than have terraform diffing my code and guessing whether I wanted to rename it or I am actually trying to recreate something. I can only imagine the headaches that would happen with a tool trying to track changes to infra as well as changes to code without explicitly tying infra state to version control somehow.

https://www.terraform.io/docs/cli/commands/state/mv.html


> I'd much rather explicitly state when real resources are renamed than have terraform diffing my code and guessing whether I wanted to rename it or I am actually trying to recreate something.

But you're not renaming real resources, you're just renaming the Terraform identifier that corresponds to them. There's no reason that changing this identifier should destroy and recreate the resource it corresponds to. If you explicitly want to destroy and recreate it, you can change an attribute that forces a recreation (typically a "name" field or whatever identifier the resource's provider cares about).


OK but how does Terraform know you are renaming a resource? It is not a daemon always running and watching everything you type. It only gets a snapshot of your code to work from when you run it, it doesn't know what your code was before, just the saved state from your last run and the real state in your cloud provider. The only way it can track the state is through the name which you have provided it, if you change that name it cannot know without inferring something. Maybe it matches up all the attributes in your code and state and infers that a rename has happened. What happens when only 95% of attributes match? What happens when multiple things match (An ec2 instance only requires 2 attributes so this is plausible)?

Example 1:

You have 2 essentially identical EC2 VMs with terraform names vm1 and vm2. You decide these are not good descriptive names so change them to webserver1 and webserver2, before running that change you also realise you only need 1 of the servers so delete webserver2 from your code. Terraform runs a plan and sees there is now only a single VM definition but 2 VMs in state. Neither of the terraform identifiers match the original resources. How does it know which one was renamed and which one to delete?

Example 2:

You use Terraform for IaC and something like Chef for configuration management so your Terraform code exclusively deals with the "hardware". A service is being migrated to a new implementation so you need to delete the old VM and bring up a new one. Both old and new implementation have the same exact hardware requirements. You make the change in your Terraform code, deleting the old resource and creating a new one with the same requirements but a different name, and run a plan. Terraform tells you there's nothing to change because its inferred that you wanted to rename.


> This experimental repository contains software which is still being developed and in the alpha testing stage. It is not ready for production use.

Not sure how much you'll want to invest in being essentially an alpha tester. That being said, if you're currently using Terraform and can wait, it's worth keeping an eye on.


Right, tfcdk and k8scdk are a thing.

Pulumi is also integrating with TF.


> for example you can't create a kubernetes cluster then add a resource to it

I have no love for HCL, but you can do this by creating a kubernetes provider with the auth token pointing at the resource output for the auth token you generated for the cluster.


Yes, however this will work (typically) if the cluster already exists (a previous run), but typically not if you creating the cluster, and kubernetes provider, as part of the same run.

IIRC you'll end up with a kubernetes provider without auth (typically pointing at your local machine), which is 1, not helpful, and 2) can be actively bad.

I believe the core issue here is that providers don't have the ability to specify a `depends_on` relation: https://github.com/hashicorp/terraform/issues/2430


This works even without the depends_on property. All you need to is have the module you use for creating the cluster have an output that is guaranteed to be a computed property.

Then use that computed property as input variable for whatever you want to deploy into Kubernetes.

We're using this with multiple providers and it works. Of course, an actual dependency that's visible would be better.


I'd love to see an example of this actually working, because I have had the opposite experience (explicitly with the Kubernetes and Helm providers); I've had to do applies in multiple steps.


This should work (as in, it will create the cluster and only then add the k8s resource to it, in the same plan/apply).

Here the module creates an EKS cluster, but this would work for any module that creates a k8s cluster.

  module "my_cluster" {
    source                          = "terraform-aws-modules/eks/aws"
    version                         = "17.0.2"

    cluster_name                    = "my-cluster"
    cluster_version                 = "1.18"
  }

  # Queries for Kubernetes authentication
  # this data query depends on the module my_cluster
  data "aws_eks_cluster" "my_cluster" { 
    name = module.my_cluster.cluster_id
  }
  
  # this data query depends on the module my_cluster
  data "aws_eks_cluster_auth" "my_cluster" { 
    name = module.my_cluster.cluster_id
  }

  # this provider depends on the data query above, which depends on the module my_cluster
  provider "kubernetes" {  
    host                   = data.aws_eks_cluster.my_cluster.endpoint
    cluster_ca_certificate = base64decode(data.aws_eks_cluster.my_cluster.certificate_authority.0.data)
    token                  = data.aws_eks_cluster_auth.my_cluster.token
    load_config_file       = false
  }

  # this provider depends on the data query above, which depends on the module my_cluster
  provider "helm" { 
    kubernetes {
      host                   = data.aws_eks_cluster.my_cluster.endpoint
      cluster_ca_certificate = base64decode(data.aws_eks_cluster.my_cluster.certificate_authority.0.data)
      token                  = data.aws_eks_cluster_auth.my_cluster.token
      load_config_file       = false
    }
  }


  # this resource depends on the k8s provider, which depends on the data query above, which depends on the module my_cluster
  resource "kubernetes_namespace" "namespaces" { 

    metadata {
      name = "my-namespace"
    }
  }


I literally implemented this not a month ago. I don't understand the complaint at all. Terraform is easily able to orchestrate a cluster then use it's data to configure the provider. The provider details does not need to be available until resources are created using the provider, which won't occur until the EKS cluster is available.


Using something similar, but it doesn't handle well cluster deletion.


You can do this with either:

1. depends_on = ... 2. implicit dependency, ie reference some cluster property in your deployment, which causes the same behavior as depends_on


The tool is ok, but developing plugins for it shows how inadequate Golang is for the job. There's so much repetition and boilerplate required. I wrote a FreeIPA plugin a few years back, it handled just registering a host and the executable weighed over 100 MB! WTF? Haven't looked at that side of things lately, I wonder if it's different nowadays.


We have a big amount of resources available inside of our Spacelift provider[0] and it weights ~20 MB.

It'll probably mostly depend on the libraries you use.

[0]:https://github.com/spacelift-io/terraform-provider-spacelift...


Definitely agree with this, Go is so verbose for the application. When I wrote a provider, I had the same problem. What made it even more worse is that I was connecting into an API that made use of dynamic json generation. So many interfaces and other hacks to get the json documents to parse correctly.


Is it a Go problem or a new-to-Go problem? I haven't written terraform plugins specifically but I have been writing Go for years and never find myself needing to write an excessive amount of boilerplate. There can definitely be some frustrations in dealing with dynamic JSON though. JSON-to-Go converters are your friend.


I was not using anything special, I had implemented my own client for IPA. Te equivalent functionality in Python (ended up using Ansible to do my thing) uses just a few kB ...



Why not use something like Ansible instead?

It too is declarative. It too can be easily extended. It's also something a lot of people already know.

I used to use Ansible or Puppet for these things before Terraform was all the rage. It was a lot more stable than trying to distributing those state files, which is a strange design to pick. There are plenty of existing modules but it's also dead simple to write your own.


I have limited experience with Ansible, but afaik calling it declarative when compared to Terraform is a stretch [1]

[1] https://blog.gruntwork.io/why-we-use-terraform-and-not-chef-...


It should be noted that the article is written to sell services for Terraform. It is unfortunately built on a few false premises that are never argued. Very few Chef developers would agree with Chef being somehow more imperative than Puppet, for example, seeing how the language was originally thought of as a superset of Puppet's.

The author does not specify which module is used for AWS, but it is not representative for how one would want to use Ansible for infrastructure. Writing idempotent playbooks is widely regarded as best practice in the Ansible community.

I have used Ansible for declaring node state in large production environments (not some dinky startup) and found it to be a very straightforward way to manage infrastructure.


Ansible is not really made for managing cloud resources and it shows - the modules are not production ready.


For GCP, both ansible modules and terraform modules are actually generated from https://github.com/GoogleCloudPlatform/magic-modules, so their "production readiness" are the same.

I understand that mitchellh himself personally created a bunch of cloud modules for terraform at the beginning, and those were likely of higher quality than whatever created by some internal developers assigned by Google/Microsoft, and might be slightly better than the AWS modules maintained by community.

Anyway, when it comes to ansible versus terraform, we shall move the discourse to states management instead. With ansible, you don't have to deal with states, but will need to clean up the cloud resources separately. With terraform, you can use the tool to clean up the cloud resources easily, but then you also have the headache of managing states. Plus, whenever you change something, there is always the nagging feeling that it will do a destroy/recreate instead of an in-place update.


I like Terraform for infrastructure, up to the point of creating the K8s cluster, then ArgoCD for keeping K8s in sync.


That's an interesting combo. What are you keeping in sync in K8s with Argo?


The operators we offer in our clusters (e.g. ECK, Prometheus, etc... the ArgoCD ApplicationSet generators make it easy to configure which features are installed on each cluster), as well as the applications developed by the development teams. Our work isn't complete yet (still working on sync for secrets and RBAC), but it's working nicely so far.


Yeah, these days I try to avoid writing any HCL and instead feed Terraform with JSON generated via jsonnet (which we were already using to generate k8s YAML). Much better templating and language features while still remaining declarative, and it helps on a team to have a single source language for such configs.


> Also the staticness of providers are a serious pain, for example you can't create a kubernetes cluster then add a resource to it.

TF def has some rough edges, but you can certainly create a cluster and add resources in a single root module (I don’t think it’s a great practice).

In this example the EKS cluster is in a module, but it can be a ref to a resource in the same module as well.

  data "aws_eks_cluster_auth" "current" {
    name = module.eks.cluster_id
  }

  provider "kubernetes" {
    load_config_file       = false
    host                   = module.eks.cluster_endpoint
    cluster_ca_certificate = base64decode(module.eks.cluster_certificate_authority_data)
    token                  = data.aws_eks_cluster_auth.current.token
  }


I never used Terraform, I started with Vagrant, then CloudFormation, CDK, and now Pulumi.

I like Pulumi the most right now.

It integrates with services like Cloudflare and Auth0 and I can use TypeScript to write my code.


I’ve had many similar frustrations about terraform, and the overall lack of visibility into what’s happening drives me mad at times.

A proper repl, with the ability to actually manage a config would be a huge step forward - I spend more time trying to figure out what vars get populated and how I can get a value into another resource than anything else. It’s like I’m constantly fighting with the HCL syntax to get what I want to happen.


If you want visibility(spoiler: it's just API calls), try using `TF_LOG=DEBUG terraform <foo>`. You might also want to set `-parallelism=1` or you'll be treated to statements printing in an order you are not expecting.


Yep, the documentation is sometimes lacking, and the concept of moving variables in and out of modules is not intuitive, to say the least.


> The biggest feature I would like to see is the ability to dump a pure representation of your evaluated configuration.

Are you asking for a dump of existing state or desired state? For existing state, see `terraform state pull`. For delta between desired+existing, see `terraform plan -out`. My apologies in advance if I completely misunderstood what you were asking for.


I am asking to dump the desired. So that I can diff the desired against the commit vs the desired of the last commit. I don't want to include production at all.


> for example you can't create a kubernetes cluster then add a resource to it

Of course you can! DM me if you want details.



I can confirm, we're using a similar approach at it mostly works.

There are still issues though, if you try to remove your cluster the k8s provider can't be configured (no module.my_cluster.cluster_id anymore) and the refresh phase of plan will fail. You can find workarounds but those I know are quite manual / ugly.


Amen! I found it excruciating that the language was always a few simple steps away from being homomorphic to JSON. I desperately needed to be able to manipulate it as data structures, not as strings. All of the ways I found to work around its limitations made me wish for something else entirely.


Have you used it since they introduced HCL2? It supports other data structures much better than it used to. Maps, lists, sets, etc. are much easier to work with.


Still a far cry away from a proper programming language, which is what we need. For example, if you want to loop over some config and generate a resource for each config, but the resources need different providers (e.g., different AWS accounts) then you just can't do it. Further, if you just want a little function, you have to build a fully fledged module. Then there are the crazy namespaces (`var`, `local`, `resource`, `module`, etc).


Yes you can... assuming your config is a map, include a key for "provider", and set it appropriately. EG in your example for multiple AWS accounts, define providers aliased as `aws.account1`, `aws.account2`, and so on. Reference those provider aliases in your map you are iterating through, and set the provider to that value.


I'm 99% sure that will fail with an error because "provider" can't be set dynamically. See this issue, for example: https://github.com/hashicorp/terraform/issues/24476


You can write Terraform using JSON if you want to: https://www.terraform.io/docs/language/syntax/json.html

Or do you mean something deeper?


I mean being able to reliably convert HCL↔JSON.


You can use JSON for your configuration: https://www.terraform.io/docs/language/syntax/json.html !


It's a language for a reason, there's a grammar, parser, lexer, ast. What's the problem?


The module passing got a lot better in 0.12 when you could pass full modules or resources as outputs and vars.


have you tried pulumi ? what's your opinion ?


I have not.


Same (hate it, love it, use it every day). I can't believe you left out the the stilted looping syntax.


I didn't figure it was worth starting...

- Lack of functions (the only real functions are modules which are basically unusable for quick computations such as "slugify this string"). - Very primitive loops. - Lack of temporary variables. I often end up looping over a list multiple times and storing intermediates. - No panic or log functions.


Add an secrets provider to avoid having secrets in state files to that list.


How does terraform compare to ansible?


Ansible focuses on provisioning machines whereas Terraform focuses on creating Cloud infrastructure. A common combo is using Terraform to provision VMs and networking settings then using Ansible to configure those VMs.

I find few if any reasons to use Ansible over a shell script. IMHO Ansible is just a weird YAML syntax to generate a "shell" script with some utilities to ship that script to nodes over the network. I find it super awkward not to mention slow and inconsistent.

For deployments I much prefer using Nix and for imperative actions I just use actual shell/python.


You can totally provision using ansible too, on most cloud vendors.

The reason to use ansible over a shell script is that the ansible playbook will be idempotent. That is to say you can run/rerun the playbook from any point without having to wipe any previous work, or worry about double applying your config changes.


> is that the ansible playbook will be idempotent

This isn't really true. I think you are correct that most of the built-in operations are idempotent but you can also do this with a small library of functions in a shell/python script or whatever you prefer. Most things you want to do on provision are idempotent anyways (install this package, download this file) or are trivial to make so (create this directory).

I would take a real programming language any day for the minor cost of having to handle idempotency myself. It would take a couple of hours to reimplement idempotent primitives to replace the Ansible standard library in just about any language.

In my mind the main value of Ansible is playbooks that others have made for you, but many people avoid these anyways to have full control.


I thinj that it's difficult to keep an idempotent shell script or programming language implementation as clean as Ansible over a long period. I deal with a similar thing at work and the Ansible stuff is still mostly good over the long haul with the weird bits like calling other scripts being obvious. The Bash script provisioner we have is just a mess. It's not that an individual can't write a better Bash or Python script but a team of mixed experience, opinions and skillsets coming and going over 7 years definitely cannot. Our Ansible scripts are about half as old, but I don't think the shell script saw significant decline after hitting an inflection point or anything, they just gradually crept away from pure ideals.

I personally find Ansible's value lies in what it makes difficult.


They're not competition. I use Terraform for infra provisioning, and Ansible for post-provisioning application setup. I also use Packer + Ansible playbooks to build my AMIs.


You can create infra with Ansible. The downside to Ansible is the Cloud Provider modules are "community" not core and some of them are buggy.


Yup, that's the best use-case. The more that cloudy / container stuff takes over the less I use Ansible tbf.


A lot of post provisioning tasks I used to do with Ansible are now handled with cloud-init.


Exactly. We use it in the pipeline for building the AMI unfortunately at my current place but it's not optimal.


I like Packer + Ansible for building machine images. I haven't really tried any alternative workflows but that has been great for my needs so far!


What kinds of tasks can Ansible do that Packer isn't also capable of?


Both tools can be used to create cloud resources and configure machines but fundamentally they are very different.

Ansible is a list of actions that you apply linearly. Each action might be a noop if it already exists.

Terraform is a tree of resources that are applied by order of dependency. Terraform also records the previous run and deletes resources that are no longer in the code.

Generally, Ansible is great at performing actions on a lot of hosts. A sort of multi-ssh. And Terraform is best adapted to manage cloud resources.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: