If you write your config in a "full-blown" programming language, then your configs are full-blown programs in that programming language. This situation just plain sucks, or at least it comes very close to (or passes) the "suck" threshold every time I experience it. Code-as-configuration demands tremendous discipline from the team.
Whereas if you abuse YAML (or JSON or XML or whatever) to create a limited and hard/impossible-to-extend DSL, you still have much more control over what can and cannot be executed by the config engine, even if the DSL happens to accidentally become Turing complete. You can embed limited shell commands in the DSL as an escape hatch, but make it difficult enough that you really have to try to make a mess.
Another example of this DSL model done mostly-right is Make.
Once you accept that idea, whether to use JSON vs YAML vs TOML vs XML vs S-expressions is just bikeshedding over syntax.
As for "why YAML in 2021" specifically? Yes, YAML is a big spec and there are lot of ways to get strings wrong. But maybe you don't care or your team is unlikely to ever go near the darker corners of the spec. For simple config files, YAML is just really easy to read and write. And if you do need multi-line strings, it's a whole lot easier than doing it in JSON.
I'm personally a big fan of TOML, but maybe YAML is still better for highly-nested data.
Of course S-expressions are wonderful for many reasons, but they share the problem with JSON of being somewhat hard to diff and edit without support from tooling.
>If you write your config in a "full-blown" programming language, then your configs are full-blown programs in that programming language.
Yes, which means we have well-documented functionality and tooling of that language to deal with various use cases. Which is not going to be the case with your ad-hoc format based on YAML or JSON.
>Code-as-configuration demands tremendous discipline from the team.
No more discipline than any other form of programming.
>Whereas if you abuse YAML (or JSON or XML or whatever) to create a limited and hard/impossible-to-extend DSL, you still have much more control
And here is the crux of the issue. Tools that are designed so that someone can keep "more control" rather than for tool users to solve real problems. The industry is sliding back towards bad old days of batch processing because of conceit and lack of lateral thinking.
Organizations don't have access to an infinite pool of highly disciplined software engineers, the less discipline or skill required to get something done quickly and safely, the more things they can get done with more people and more kinds of people divided into teams with different responsibilities and different kinds of code.
This is an important point. There was even a discussion here a few months back (I remembered it being more recent, but it was 4 months ago) on an article titled "Discipline Doesn't Scale" [0]. Discipline works up to a point, but the more your system relies on discipline, the more fragile it becomes as you scale (in people, in size of the system). At some point you'll hit a wall where your system is too big or you have too many people and discipline falters as a consequence, or you get slowed down maintaining discipline beyond what's reasonable for your field and customers.
I consider myself a highly disciplined software engineer, and I still want as many guard rails for myself as possible. I am a human, I make mistakes; my schema validator does not make mistakes.
The pool of people who understand any popular scripting language is incomparably larger than the pool of people who understand your clever dialect of YAML, JSON or XML.
With a schema, you have a fully-documented and soundly, statically typed DSL. If it were a Ruby library, you'd have to read the docs anyway, and you also lose static parsing and validation.
The best of both worlds is to use a de facto standardized non-executable format like INI or JSON that nearly every language supports.
Then if you need to, you can create complex or overly long configuration files in Python by inserting keys into a dictionary and dumping to ConfigParser (or however your favourite language does things). For example, its useful when writing a test for many permutations of something similar.
Meanwhile the parsing side is simple enough to be re-implemented in an hour when the time comes to rewrite your whole stack in C+Verilog for real ultimate performance.
The 2 main things are:
1) Using your own bespoke config format or some pet format that's not widely supported adds needless friction to writing little duct tape scripts, testing harnesses, and misc tools. It also adds unnecessary difficulty when porting parts of your program to new languages.
2) Using a Turing complete config format even if it's not bespoke makes all the drawbacks in (1) even more apparent.
Yes, which means we have well-documented functionality and tooling of that language to deal with various use cases. Which is not going to be the case with your ad-hoc format based on YAML or JSON.
Really? Unless it's written in Haskell or something else with a very strong type system, you won't do better than JSONSchema for validating the config file.
And here is the crux of the issue. Tools that are designed so that someone can keep "more control" rather than for tool users to solve real problems. The industry is sliding back towards bad old days of batch processing because of conceit and lack of lateral thinking.
Too much freedom is a bad thing. The industry is not "sliding" anywhere. We tried code-as-configuration, it required too much discipline, so the pendulum is swinging back. As pointed out elsewhere, hopefully Dhall will save us from all this by being the happy balance between expressive and chaos-limiting.
I almost agree with you, but then again I recall the use of YAML for Ansible configuration, and the pain that bolting on additional things has caused.
It has to be said there are a lot of things that are almost fully-scriptable, for example the "mutt" mail-client. It has a configuration language, but it isn't real in the sense that you can't define functions, use loops, etc. I eventually wrote my own mail-client so I could do complicated things with a real configuration language (lua in my case).
Seeing scripting languages grow up in an adhoc fashion often leaves you in the worst of all worlds. Once upon a time I decided I wanted to script the generation of GNU screen configuration files for example. I made a trivial patch:
* If the .screenrc file is non-executable - read/parse.
* Otherwise execute it, and parse the result.
Been a few years now, but I think the end result was that I wrote a configuration-generator in Perl that did the necessary things. (Of course this was before I submitted the "unbindall" primitive upstream, which was one small change that made custom use of screen more safer - using it as a login shell, for customers who shouldn't be able to run arbitrary things.)
> Whereas if you abuse YAML (or JSON or XML or whatever) to create a limited and hard/impossible-to-extend DSL, you still have much more control over what can and cannot be executed by the config engine, even if the DSL happens to accidentally become Turing complete. You can embed limited shell commands in the DSL as an escape hatch, but make it difficult enough that you really have to try to make a mess.
The article actually mentions Dhall as a solution. This engineering problem has been resolved.
Yeah, I am really excited about Dhall. I think this is the future, it supports the types of abstractions that we need without the mess of full templating or full turing completeness.
The one downside to Dhall is you really want to have an implementation for it in each common language. You can use it to generate YAML, but I think it would be better if tools understood Dhall and that is a bigger ask because it is a more complicated implementation.
Let's build Dhall implementations for every major language, convince Gabe to format things in a way that makes it look more familiar to non-haskell people and consider this problem solved.
I love Dhall and really don't understand why the industry hasn't standardised on it yet, seems like a no brainer.
I disagree with your point that it should supported by each language however, I think it's much better to use something simple like JSON as a "compilation" target since it's easy for machines to read and lets users pick the configuration backend.
Use a smart language like Dhall or bazel for managing configuration and use a mundane format like JSON for the machine, let the Dhall binary bridge the gap.
The trouble I have with DSL's is I don't work on the scripts often enough to become proficient with them. If I haven't looked at it for six months then I'm going to spend most of my time googling or reading half baked documentation.
The “bikeshedding over syntax” issue misses an important point from the post:
> [I]n many ways, XML and XSLT are better than an ad-hoc YAML based scripting language. XSLT is a documented and standardized thing, not just some ad-hoc format for specifying execution.
Standardization and reliable documentation really is an important risk mitigation compared to a “widespread” convention in YAML that might disappear (and even become confusing to new developers) if some new YAML-based API becomes more popular. In many cases this stability will not be worth the annoyances of XML, but it’s not a trivial concern.
> you still have much more control over what can and cannot be executed by the config engine, even if the DSL happens to accidentally become Turing complete
Turing completeness is a red-herring when it comes to config languages IMHO. Purity is much more important consideration, e.g. to ensure it can't delete files, or vary its output based on random network calls.
Besides, there is no true Turing complete languages as we are dealing with finite computers.
So my preference is to have a simple language with explicit limit on number of operations and the amount of memory its interpreter can access before aborting rather than a complex config without any explicit limits on complexity leading to exploits with stack or memory overflow in pathological cases.
> Besides, there is no true Turing complete languages as we are dealing with finite computers.
This is not true. Turing complete languages are so because their halting problem is undecidable, it is irrelevant that the computer you run a python program has finite memory. Check out languages like Agda where you can do general purpose computing but are not Turing complete, since all programs can be proved to halt.
> I'm personally a big fan of TOML, but maybe YAML is still better for highly-nested data.
That is also my experience. TOML is really cool for simple key/value stores, but keeping everything linear in config makes nesting error-prone, eg with [[table.subtable.list]] to append an item something to table.subtable.list. It's really easy to miss a nesting level by accident.
Also related, newcomers in TOMLland find it really confusing that appending a single line to the configuration file will append it to the latest defined table, not as a top-level key.
> Another example of this DSL model done mostly-right is Make.
I think you just internalized the pain of make. I used to be good at it, didn't program c for 20 years and came back to it for a few projects and wanted to tear my hair out.
The pls I'm currently working with have declarative build dsls in the same language (mix.exs for elixir and build.zig for zig) and this is fantastic.
So it should be for configs. Use a truly turing complete language if you need control flow. I think hashicorp got this right but by then everyone hated to have to learn ruby.
I think this is the real reason why yaml configs got popular. If you had a dsl in x language, programmers would get defensive that it was in blub and not their pl of choice. Yaml was a way of being a language agnostic neutral ground.
Regarding not wanting to learn Ruby, I wonder how much of the inertia is installation. I mean, some people seem to have visceral reactions to the syntax (I've even seen people say they dislike Elixir because it's like Ruby ). But the lesson I took away from using Ruby DSLs is users don't want to deal with figuring out how to safely install a new version without borking the system version, segregate workspaces, install packages, etc. Python suffers from that too but for some reason we all ignore it, maybe because a lot of people consider it a newbie or "easy" language and complaining about it would make them seem like "not a real programmer".
oh 100% specifically re: hashicorp using ruby, there was definitely a time between 1.8 and 2.x where installing ruby was a nightmare. That's when i quit using ruby! Even though I loved ruby. And when I saw hashicorp products using Ruby as their DSL a part of me was worried it was not a good choice for those reasons.
Python ecosystem definitely suffers from this. I tried to do some machine learning experiments and basically all of the repos I wanted to use were on 2.x and after 30 minutes of faffing around I gave up and moved onto other packages. However, the biggest pain points for Python came in the 2-3 transition (and TensorFlow x->y in general). By then Python had too much momentum and popularity (and every undergrad learns python). TensorFlow, well at least there is a competitor (torch) and so we see that TF's popularity has basically been sucked dry, and I have no doubt that a large portion of it is just how awful Google+Nvidia have been in managing the TF releases.
>Of course S-expressions are wonderful for many reasons, but they share the problem with JSON of being somewhat hard to diff and edit without support from tooling.
If you want support for trees xpath has been there for 25 years now.
I probably need to write a blog post about this, but "full-blown programming languages" have 2 features that config files generally don't. And people often conflate them:
1. arbitrary I/O -- can I read a file from disk, open a socket, make a DNS query, etc.
2. arbitrary computation -- can I do arithmetic, can I capitalize strings, can I write a (pure) Lisp interpreter, etc.
I claim that the first IS a problem but the second ISN'T.
Arbitrary I/O is a problem because it means the configuration isn't reproducible / deterministic, so it's not debuggable. Your deployed system could be in a state that depends on the developer's laptop, and then nobody else can debug it.
The second is NOT a problem. As long as the state of the deployed system is a FUNCTION of what you have versioned/configured, then it's no problem. Functions are useful. Pure functions can also be expressed in an imperative style (another design issue that's commonly confused).
It is possible to have a programming language with well defined semantics, the ability to have libraries and utility functions and other nice things, while not requiring Turing completeness or the need to always expose file I/O etc. This allows reproducibility and not hitting things like the halting problem in your config file.
Purely my problem for not knowing but I ran into an issue of where I needed to escape characters in a password in a yaml file. Having said that I really like yaml as a Ruby dev.
Preface: this is why people like to complain about YAML, but really I think it's a feature and not a bug that you can write strings in so many different ways, to serve the many different needs for entering text into config files.
This is probably not a common piece of YAML knowledge, but it's arguably better to use "folded" style for a password:
password: >-2\n
asdjoi'";j;oj;90\[2301@
Or use single quotes, which signals to the YAML parser not to treat any characters as special, but then you need to escape the literal SINGLE QUOTE character (') by doubling it:
password: 'asdjoi''";j;oj;90\[2301@'
This is completely valid yaml that reduces to the JSON equivalent:
{"password": "asdjoi'\";j;oj;90\\[2301@"}
And of course you can always write JSON syntax for when the text escaping gets hairy, because YAML is a superset of JSON.
If you write your config in a "full-blown" programming language, then your configs are full-blown programs in that programming language. This situation just plain sucks, or at least it comes very close to (or passes) the "suck" threshold every time I experience it. Code-as-configuration demands tremendous discipline from the team.
Whereas if you abuse YAML (or JSON or XML or whatever) to create a limited and hard/impossible-to-extend DSL, you still have much more control over what can and cannot be executed by the config engine, even if the DSL happens to accidentally become Turing complete. You can embed limited shell commands in the DSL as an escape hatch, but make it difficult enough that you really have to try to make a mess.
Another example of this DSL model done mostly-right is Make.
Once you accept that idea, whether to use JSON vs YAML vs TOML vs XML vs S-expressions is just bikeshedding over syntax.
As for "why YAML in 2021" specifically? Yes, YAML is a big spec and there are lot of ways to get strings wrong. But maybe you don't care or your team is unlikely to ever go near the darker corners of the spec. For simple config files, YAML is just really easy to read and write. And if you do need multi-line strings, it's a whole lot easier than doing it in JSON.
I'm personally a big fan of TOML, but maybe YAML is still better for highly-nested data.
Of course S-expressions are wonderful for many reasons, but they share the problem with JSON of being somewhat hard to diff and edit without support from tooling.