No offense to the creator of YAML, but: The fact that it became one of the de-facto standards for cloud tooling is an absolutely damning statement about the state of the industry.
I get that XML is about as sexy as mainframes, and that a lot of folks here probably have PTSD from working with Java/Spring web apps, but YAML is about the worst of all worlds.
Though I think the real problem is that real-world configuration files are way too complicated for a simple/dumb/logic-less representation like a .ini/.conf file, so someone thinks to add some logic to is - which is just config-as-code. In a terrible programming language.
If you want config-as-code (and you want to!), just do it properly and use a proper programming language for it. Don't care which one, be it JavaScript, Python, Go, PDP-11 Assembly, or Rust. But please stop with these half-measure DSLs that just don't cut it.
Most people who lambast XML probably have never used XML or never had to need XML.
I designed a complex data acquisition system and after a lot of research, I settled on XML as the only viable option for complex user configuration, that is both readable and rich in content.
I then built a UI system that works with the XML and generates cofig docs with ease.
Sure, XML has been used in SOAP like systems, and it rightfully gets a bad rap, but that is more on the user than the technology.
A hierarchical document with values and attributes and custom tags? That is an almost DSL in itself.
> I then built a UI system that works with the XML and generates cofig docs with ease.
I lost you there. Not that I'm criticizing you since I've gone the same route and built a complex UI tool to manage said config (complete with XSD schema validation and config schema migration using XSLT for version upgrades).
But now I realize that modern developers don't want a GUI to manage their config. We want to store it git, review changes and perhaps even write our own automation and templating around it.
YAML is certainly a flawed format for most of these purposes, but so is XML. It is unnecessarily verbose and it carries a lot of complexity which was designed for a highly-extensible generic document format, but not for configuration files. XSD, Namespaces, Entities, Embedded DTD, CDATA blocks... You can't just ignore all of these, and there are very few parsers out there which work on a well-defined subset of XML. And even there, the whole attribute-vs-child-element choice is a giant distraction and constant source for unnecessary bikeshedding.
YAML has serious ambiguity issues, but there are better alternatives that have great library support like TOML. We don't have to go back to the excesses of the early 2000s and use XML as a configuration format.
In XML's defense, the attribute-vs-child-element choice is always obvious and straightforward when you're using it as a markup language. It's only when you use it as object notation that it becomes wonky, because there the choice simply doesn't map to any straightforward characteristic of the problem domain. I'm not sure I want to hold XML responsible for not being good at being something it isn't.
And, on that note, I think that if I could pick as single thing that irritates me the most about YAML, it's that it isn't actually a ML.
I'm not criticizing having attributes-vs-child-elements. The problem with XML is not that it exists or that it is a particularly badly designed markup language.
The problem is that XML is ill-suited for configuration files.
Why? I can store it in version control. I can validate if the file has syntax errors or schema errors etc. etc. With a proper platform it's easy to generate or parse. It allows for both hierarchical structures (looking at you .ini). The only downside might be that you need a larger library to parse it. But i probably encounter xml somewhere along the line anyway due to interactions with a 3rd party system, so that point is moot.
You have a low bar for config languages. I'd also like one to not be overly verbose, not be clearly intended for markup instead of object storage and not have security issues until you configure the parser just right (though that's often not an issue with configs). XML fails all of these, and YAML isn't much better.
If we're limiting ourselves to just JSON, INI, XML and YAML as potential choices, I get why people cling onto one of these suboptimal choices and then fiercely defend it, but there are other options. There's libconfig, JSON5, Dhall, various interpreted languages...
Sadly these alternatives are far from mainstream and have no support in standard libraries, so I think most developers will continue to pick whichever common option's issues they can deal with.
I'd like my config files to have two words in it "do it" and for it to just know what I want.
But at the end of the day that's not possible because reality intrudes. You give up a lot for that lack of verbosity, not everyone will agree that it's a worthy tradeoff.
It's a noble goal to want a simple configuration format, toml is far from the simplest, line separated options format is simpler. The fact that it needs a parser indicates that it creates a bigger parsing problem than necessary.
Because people reinvent half of it, different half every time in each project, and call it 'configuration' which somehow gives them licencia poetica for whatever crazy dung they come up with.
Compare ansible vs helm vs github actions vs literally anything with nontrivial config.
Note that if you asked me for real, I'd be a dhall proponent, but most normal people look at me funny the first time they hear it and then the second time they see the syntax.
I'm curious why you think that is? Especially since you are likely to want to style the configuration for display to the user in many cases, having it be a document makes a lot of sense.
That is, if the value is something you would expect to be able to show to a user, then it probably shouldn't be at attribute. If it is a value that changes how you would show it, then attribute makes a lot more sense.
I also think a strong adherence to any preference between attribute and markup is a touch too dogmatic. The only real distinction at the language level is if you allow children, or not. If it can have children, it pretty much has to be a child and not an attribute.
Which is especially problematic for config files where you want helpful hints or a place to store the “old value” while you do something else with the app. I have an app where I change a specific config almost every time I use it. I usually just copy the entire darn node and rename it to something that doesn’t get deserialized. Closest I can get to holding the value in a comment.
Then again, Gradle has come to show why that is a terrible idea.
I think at some point 24 out of 28 Gradle projects I had access to at a certain customer had variations in either kotlin/Groovy style or the way they did or didn't use variables, how they did or didn't do loops or maps and what not.
With Maven you (or someone who know Maven) can immediately look at a rather small, very standardized file and start making educated guesses and so can an IDE.
With Gradle you sometimes have to run it to actually know what it will do.
I had the same experience with Maven vs SBT (scala build system, config is scala). At first it is really cool to have access to a full programming language (in particular when it is the same as the one the project is in, which means that you do not need to "switch brains" when working on the config), but quickly people start trying to be smart or cute, and it becomes a big mess. In particular in Scala, where people _love_ defining new DSLs and favor cuteness over readability. After two years working with SBT I still do not really understand some of the DSLish constructs used in there (and I tried to read the docs).
On the other side I fell in the trap of trying to overcome the limitations of purely declarative config formats by using jinja templates, which also ended up being a very bad idea and a maintenance nightmare.
For most projects, my approach is now to try to be as standard as possible compared to the particular community in the tech at end, and resist the urge to be smart or cute (hard!). Configuration always sucks, and I now prefer to just suck it up and get done with the config part, rather than loosing time reinventing the wheel, ending with a config that still sucks _and_ no one understands.
The good thing about Maven is it is XML so everyone wants to keep it as short as possible ;-)
(More seriously: with Maven shorter and more boring is a sign that everything is correctly configured. Maven works by the convention over configuration principle so if you don't configure something it means it follows the standard. Which again means if you see someone has configured for example a folder or something that usually isn't configured it means they have put something in a non standard location.)
The json modules in the same languages doesn't support the madness and doesn't necessitate a safe loader
It's possible someone could come along and write a json library that would support this, but somehow we have made it this far without it and that's a good thing
The point is that yaml and xml both have side effects in the form of require and eval that json won't, and frequently people are unaware of this
Perhaps yaml and xml have _more_ ways to inject behavior into an application, but I would still not consider JSON safe in any way. Why would JSON.parse() even exist if `require()` and `eval()` were safe to use?
> But now I realize that modern developers don't want a GUI to manage their config.
I think, like almost all software, it depends on personality and other factors like how frequently someone has worked with a language/tool/etc.
If someone has the traditional Unix "read all the man pages in their entirety" kind of brain, they'll probably never use a GUI.
If someone has more of a "learn by example, then refer to the docs to cover cases the examples didn't handle" brain, a GUI can be a much easier way to clearly expose all the potential functionality of a tool. It's much easier to understand (IMO) than e.g. a template configuration file with every possible option present but commented out.
I recently started a new project, and just outright didnt want to use yaml to do it. Sadly, im not everyone, and if i plan to release this project, i need to account for that. Whats a guy to do?
Ill tell ya hwhat... I made some simple functions; one to load any yaml/toml/json from file, by just looking at the extension (and providing an override for not normal file extensions). Another to output any data as yaml/toml/json.
I defaulted to using toml for my project, but have provided ways for everyone to be happy, with zero mucking around.
In any case, the library i linked has the functionality to load any of the three configuration types, so while i prefer toml, if you grabbed my (under development) project, and hate toml, you can use yaml or json, if you wish.
No. You need to define what configuration files are valid or not, but a subset of XML on the basis of syntactic and structural alternatives is lame and nonstandard.
Get a library that parses actual XML, allow anything reasonable as input (for example, no external entities, DTDs, PSVI etc. to keep the file self contained) and "flatten" CDATA sections, entities, namespace prefixes, idrefs etc.
>But now I realize that modern developers don't want a GUI to manage their config. We want to store it git, review changes and perhaps even write our own automation and templating around it.
not all configurations are maintained by developers, or at least theoretically they shouldn't be.
I have used XML a fair amount in past years, but now avoid it.
There is a subset of XML that a decent language for some use cases. In particular it is good for documents whete you want a _markup_ language.
But I've seen it used for a lot of things where it wasn't a great fit.
And xml has too many features, which leads to implementations that are inconsistent with each, often slow, and have security problems such as xxe.
I wish that there was a standardized simplified xml format, that would avoid many of xmls problems and meet the needs of 90% of applications where xml is a good fit.
> Most people who lambast XML probably have never used XML or never had to need XML.
This seems like an absurd thing to say. XML went through an extreme bout of popularity. Maybe you could plausibly say that people have been soured on XML by ill-conceived uses for it that don't demonstrate its strengths... but you think most people have never worked with it? Come on.
I've done my fair share of systems integrations and the number of teams that did not know xml or have never used it in a professional settings was around 20%. Of those who did use it a staggering amount of people never understood namespaces and started testing for equality on the element name and namespace prefix string instead of the namespace declaration. When somebody claims they "know" xml I initially treat it as a developer saying that he/she knows 'sql' while reviewing their code and seeing them do joins in the application logic.
That is the problem with XML, there is so much logic that is needed in the application so to be able to understand it. You just need so much knowledge.
> Maybe you could plausibly say that people have been soured on XML by ill-conceived uses for it that don't demonstrate its strengths... but you think most people have never worked with it? Come on.
So, according to the link - The average software developer age is between 25 and 34 years.
I think we should also define - are we talking about people who have worked with XML because they made a google site settings xml file OR people who have done serious work with XML and know what they're talking about?
First type - pretty much everyone.
Second type - not very many. I'm pretty much the only person who knows anything about XML wherever I go.
if you are 25 you have probably not done anything with XML or at least not anything important.
If you are 34 you might have, but I mean the last time I did anything really important with XML was 2013, I did a few other things since then because I knew XML was the best solution but that was me or because there was a very niche thing I was doing and the company was providing an XML api.
I bet most of the age 34s have not done anything meaningful with XML either, even though if you are 34 I suppose you probably had some ticket at some point that took you a week and you thought wow, my extensive experience with XML now gives me the right to grouse about how bad it is! If only everybody knew as much as I the world would be a better place!
on edit: my example of google site settings file is an example of some trivial usage, not meaning that pretty much everyone has done that exact trivial usage.
I am 34 and I feel like when I started at this job we were still going through the “everything must be XML” hangover. But hey there’s always a higher mountain.
It depends on what age you are and whether you work with documents. If you started working in the industry past 2010 and didn't have to work with generating/reading (X)HTML/OOXML/ODF than it's rather likely you've never had an experience with XML (fortunately SOAP was deprecated very quickly).
That arrays point is such a weakness in XML that I rarely see addressed. Arrays and lists are such a common data structure in almost every programming language of the past 40 years that not having first class syntax for representing them is absurd and a huge weakness that makes XML a non-starter for me.
The qualities of sets that arrays don't have (and vice versa) are irrelevant to the point of neither being implicitly representable in XML.
You're either providing complex objects as properties or you're providing a list of complex objects. Worse, you can have a combination of both. Without a schema it is not possible to infer whether either or both is happening.
No you've misunderstood my point. This doesn't work for cases where one child is in fact a property that is a complex object.
XML claims to solve the problem of attributes vs children but then falls short at the first hurdle by not discerning between a single complex object as an attribute and an array of complex objects as children.
JSON and YAML do not have this problem as they are explicit in their representation.
YAML example:
parent:
child: name
vs
parent:
- child: name
Try converting each of these to JSON. The former will give you an object property called child, the latter will give you an array property called child with one element
I think the verbosity is not a problem. For example if you compare
["string1", "string2"]
to
<list>
<e>string1</e>
<e>string2</e>
</list>
then each element has about four bytes overhead (<e> instead of " and </e> instead of ",) plus some overhead for the list itself that may be offset by putting the name of the list itself into the element.
However, the issue is that you have to write a custom parser. There is no direct mapping between your data structure and the XML file. This developer ergonomics is a big win for JSON and consequently YAML.
> There is no direct mapping between your data structure and the XML file.
i think that's by design tbh.
it's only a big win for JSON (and YAML) because the default case works OK - but every time someone has a problem parsing numbers in JSON (because the value is bigger than Integer.MAX in the host language), this is the cause.
Yes, I understand that (and I like XML as a format and XSLT 2.0 as a language). However, from the popularity of JSON, it seems that for most cases it's the easier choice.
Take any random REST API for example. If it returns JSON, you can integrate it more easily than if it returned XML. If you need special cases like large numbers (or date-times), you handle only those.
I'm confused? Integrating XML was fairly easy back in the day. If in a dynamic language, serialize into a DOM and then use xpath to get data out. If in a static language, parse into your objects.
With JSON, you can mostly do the same. Such that I don't necessarily see this as a huge advantage of XML, mind. Having a schema does have some advantages, though.
JSON maps only to javascript, but only because it was designed as a subset of javascript, for others you have to use DOM or serializers, then there's no difference between formats. For this matter, xml has generic serializers than can be used instead of writing custom one every time.
If you interpret the start and end tags of the child elements as syntax indicating the type of each value, then those tags are analogous to, say, the quotes that enclose a string literal. In other words, in
<foo>hello</foo>
<foo>world</foo>
the <foo> and </foo> serve the same purpose as the double quotes in
"hello",
"world"
with the added benefit that the type system can be much richer (i.e. not everything is just a nondescript string value).
And you don’t even need a comma to separate the values! ;)
The main reason i avoid any typeless language is dates... how do i represent a date/time including a time zone has been badly reinvented so many times. A string type is never the way to go there in my opinion.
One of the classic lessons of the Falsehoods Programmers Believe about Time is that in general you can't correctly do better than simply storing the user's input (and the instant and place they entered it from) verbatim, unless you know something more about what they were entering. It's usually fine to store times in the past as a timestamp since the epoch plus a location, but the meaning of "2025-01-28 15:00 in Europe/London, for the purpose of a meeting that's being hosted there but is accessible by video call" is much more subject to change when e.g. countries change time zone. It's also not necessarily the same as "the absolute point in time 2025-01-28 15:00 assuming London's time zones stay as predicted since I entered this on 2023-09-21" or "2025-01-28 15:00 in Europe/London, for the purposes of a meeting that's being hosted in Lisbon but which I'm accessing by video call from London" (because then the Lisbon local time is the source of truth, not the London one, if Lisbon changes time zone).
The problem is that xml just isn’t particularly human readable. I don’t think there’s more to it than that. The brackets just make it overly verbose and difficult to read at a glance.
> I don't feel the same ease with JSON or YAML though.
I imagine the same is true for people that don’t like XML. There’s just many more people that find it easy to read JSON, even though there are a few that find it easy to read XML.
Your not alone: it's verboden requiring " around anything that is a string. Commas to separate array elements. It's a technical format without the technical foundation. It's the civil engineering equivalent of building the Golden Gate Bridge out of wooden beams as that is what you have, not what you need.
The game Rimworld uses XML for describing all of its game objects and it makes modding a fantastically wonderful system as you can modify/replace parts of the object using XPath. The end result is mods rarely conflict as they are able to target the specific mutations they want quite easily
> I then built a UI system that works with the XML and generates cofig docs with ease.
It looks like you're experience is mainly with XML as an underlying format, with human only dealing with it either at the coding level or through a tool generating the needed confs. In that kind of scenario I'd wager any coherent file format would probably work, even if the configs where encoded in brainfuck in the final step.
XML gets hated because we also had to read it and hand edit it as humans, when dealing with system configuration files (the source that will generate the rest of the configuration) and other upstream documents that are the base input for the system to read downstream when the GUI aren't available for that.
I've worked with a Symphony code base that used XML for all the routes and DI declarations, and yes it was an utter pain to write an edit tags for such simple and repetitive configurations when even an ini file would have been good enough. And god have mercy of the guys that put CDATA sections in the middle of that just to be sure CJK chars wouldn't accidentaly trip the syntax.
I have delivered XML data products comprising terabytes of information in (when I last checked) more than 800 schemas to companies around the globe, and people who have a rosy view of XML are using missing how it usually works in practice. XML is extremely heavy and brings a lot of half-baked ideas, and consumers are almost never flexible in ingesting the XML. It means teams wind up supporting insanely convoluted schemas that customers will never migrate off of.
I actually think that XML has some of the best tooling in the entire industry. The problem is: It's mostly commercial tooling, which costs money. Stuff like Oxygen or XMetaL is pretty neat, and I always found Visual Studio's "Create Web Client from SOAP" pretty useful (especially if the Web Service is written in .NET, since it auto-generates a proper WSDL file).
I find XML tool in general feature complete and great.
Especially editor integration works great as opposed to YAML which is so ambiguous that it IntelliJ IDEA constantly breaks indentation during copy and paste.
And better conceptual docs because it's supposed to be that elements are complete deserializable objects. Having to resort to xpath as the default way of poking at XML where JSON has a good implicit default schema made XML feel so clunky.
At least xml has xpath ;-) the implicit Jason to object mapping has failed me on character encoding (no way to specify that in json). Stupid type errors (date/large numbers).
I have a deja vu over YAML and XML. There was a similar discussion about this a month or so ago, IIRC.
And I haven't changed my opinion since then, having worked extensively with XML: it is a plague that was brought on into the world and it needs to be killed with fire. I understand that when it was released there were no other alternatives so "it was better than nothing", but it should have died right after JSON was invented. But no, why have cleanly formatted file when you can have an XML one...
JSON syntax is optimized for serialization and thus unsuitable for other purposes, and for serialization loses to binary formats except for one use case: it's a serialization format that can be nicely embedded in html.
> I settled on XML as the only viable option for complex user configuration, that is both readable and rich in content.
How was it more readable or richer than JSON? There's more stuff in XML, but in my experience that stuff doesn't actually help you any. The schema validation has a lot of detail, but since it can't access your actual system you can't really validate in that detail (like, maybe you can validate that an ID is between 6 and 8 digits long, but really you just want to validate that it's an ID for something that's present in your database). The distinction between attributes and nested tags feels like it should let you express more, but in practice it usually just gives you two equally reasonable ways to write the same thing and causes more confusion. Comments and non-tag text nodes feel nice, but complicate your parsing more than they're worth.
> Sure, XML has been used in SOAP like systems, and it rightfully gets a bad rap, but that is more on the user than the technology.
If one person uses the technology wrong, it's a problem with that person, but if most people use the technology wrong, it's a problem with the technology.
> If one person uses the technology wrong, it's a problem with that person, but if most people use the technology wrong, it's a problem with the technology.
But yaml has the exact same problem. People use it where they shouldn't.
Do a mental exercise and think about what would k8s with XML configuration would have looked like and if would have taken off like it did using XML for configuration instead of YAML.
This is just an example.
Besides Microsoft frameworks that use XML as configuration, and Java frameworks and servers using XML for configuration (Tomcat comes to mind), no one else uses it. Maybe traditional software whose programmers don't know any better.
So yeah, I think the world doesn't like XML for really good reasons.
I would have very much preferred it. It would be a lot more readable and accessible. Nowadays there is tooling, but imagine having schemas and autocomplete for all the K8s files from day 1.
So pretty much all enterprises use it. Not bad for tech not in fashion.
K8S problem really isn't YAML, it's that the thing it's trying to configure is a naturally complicated space that really wants to be typed but can't commit to any one language either.
Configuration through code requires a lot of organisational discipline. That’s easy to do if it’s just you and your own code. It’s easy if you have good social connections between the people who review each others code (and you also have code review.)
One bad apple can ruin everything though:
from config import School, Teacher, Course, MRS
MNH = Teacher(MRS, “Marissa”, “Neve”, “Harman”)
BIO = Course(“Biology”, MNH)
ST_SIMONS = School([BIO])
Oh what a nice tidy config you have there! Let’s ruin it!
courses = [BIO]
if str(today()) < “2023-04”:
courses.pop(BIO)
if today() > DIVORCE:
for c in courses:
if c == BIO:
c.teacher.name = (“Marissa”, “Cox”)
c.teacher.title = “Ms”
# gibberish ad nauseam
I suppose it’s possible to write nonsense code in any language including YAML, so I don’t know if this is a very good point or not. To put it diplomatically though: your cadre of YAML editors, should they be moved over to using Python, are probably the ones who need the most help writing clear code.
This is exactly what I fear will happen. If you look at the YAML some people produce and extrapolate that to one of the most dynamic languages available, it is going to get ugly.
Programming languages for config are better, and I will throw up if I ever have to see “list comprehensions” and null checks in Terraform ever again, but it requires people who can code. If you simply replace YAML with Python and that’s folks’ first time using a normal language, it won’t work. Hence I’m happy to stay with YAML mostly, in such a scenario.
Then again, the YAML of Ansible or various CI platforms basically is code. People who already successfully write that will do well enough in Python. (Not that YAML isn't used in plenty of non-code usecases.)
It is code, but without any of the aspects usually ascribed to code. Abstraction, tools like dependency injection (fancy term, but simple and highly important concept), more complex looping (not just `for each item do`; `for each item do, if item...` etc. are needed).
Unit tests?
Not to speak of tooling support. I can write a Python application with the strictest type settings and have mypy do a lot of heavy-lifting for me, before even running the app once. A bit like Rust. Check out the typestate pattern for what I am on about. It's invariants enforced in the type system, by the compiler. Impossible to misuse: your code simply won't compile. All of that is impossible to have if your types are strings only, with the odd bool and float inbetween.
I will accept that we cannot have ops people be at least medium-grade developers, which would be needed to apply these topics (I consider myself an in-between, leaning dev). That's simply infeasible. It's two different worlds. I will not accept the premise that these things aren't objectively better though! They're practically inachievable, sadly (or at least a decade away).
So no, in that advanced sense, YAML is not code. And if you can do YAML, yes you will get Python syntactically correct with a little practice. That is only 5% of the way though. Note I'm also not advertising for Enterprise Java-level of code... but more than "YAML but in Python".
I disagree strongly: configuration should not be code. Code should be code. Configuration should be static and simple.
It is not trivial to tell what code is doing, and as soon as your configuration is code then really you're developing a new application which implements another configuration language.
Sure, if your infrastructure landscape is describable statically, that's how it should be done.
Most scenarios nowadays are _not_ like that anymore. There isn't 5 servers next door, there's 300 serverless whatevers half a globe away. Are you going to have 300 list entries, 230 of which nigh identical but 70 subtly different? Trivial in code, almost impossible to express statically.
What you're describing though is a configuration variability which would be expressable by a static tagging regime - still simple.
If your configuration is getting more complex then that though, again, it's not really configuration anymore - it's a management application which needs to be developed and treated like that. And why it's that complicated should be re-evaluated - i.e. how come this is "configuration" and not something the application detects for itself? Why is it being surfaced to the user (operator) at all?
I think whitespace sensitivity was not the point of the exercise.
But on that topic, I feel whitespace sensitivity is a bad idea even for generic-audience configs. People learn how to use parentheses in primary school. Explicit grouping and nesting isn't black magic.
In general I am sympathetic to whitespace - I find the argument "whitespace insensitivity means the code now contains two conflicting sources of truth: whitespace which the programmer uses, and braces which are the only one that matters because that's what the computer uses" to be compelling.
But the whitespace sensitivity of YAML is particularly bad because of e.g. Helm. Templating a whitespace sensitive language with string interpolation is such a terrible idea.
> I get that XML is about as sexy as mainframes, and that a lot of folks here probably have PTSD from working with Java/Spring web apps, but YAML is about the worst of all worlds.
I really like your sense of humor, much appreciated. Question is whether the PTSD came out of the XML or the Java usage.
Also indeed we're eager to see finally some reasonably good Assembly configs, it'd definitely be a blow in the face of Rust magicians.
I would vote for Prolog as a config language, though. If memory serves right - some 10 years ago Matt Sergeant of Perl community did a very good talk about this approach, as result of his explorations of this area. Interestingly he has some definite experience with XML also https://www.xml.com/pub/au/22.
Tcl actually excels as a configuration format as it has a simple syntax and you can initially restrict the available commands to a safe, non-Turing subset then add them back in piecemeal as they become necessary for more powerful scriptable setups.
YAML is easy to read, everything has a parser for it, and it plays well with version control, which cannot be said of most serialization formats. Not sorry for using it in projects and will continue to do so.
Yes but it is not easy to read (or write) and version control works less well with it than YAML. And having more complex features is part of what people disliked about XML in the first place.
If that exact same configuration were JSON or XML would it be easy to sort out? It seems obvious to me that it wouldn’t because the configuration itself is complex.
Structure and schema validation are not typically something your VC cares about since it treats everything as text. So I'm not seeing why that resolves the problem.
Except you were just talking about "merge conflicts..." two valid versions of the configuration don't necessarily stay valid when they get merged together.
Because a savy developer will validate their schema when sorting out merge conflicts across the whole configuration, instead of commiting a badly merged file that will fail starting up the Kubernetes cluster in the sea of YAML files.
I think it’s because XML was just that painful to deal with. When someone offered an alternative, any alternative, that didn’t look utterly crazy, people grabbed it with both hands.
All the languages/tools that use it as a config/data format were started during that period, before people realized that all they’d seen of yaml were toy examples.
I think YAML has a particular valid use case: short, well defined, human readable config files that need to be edited by less technical users.
The idea is that there's a single config.yml file in the project root. It's mostly empty but could grow to ~100 lines if all config options are are changed, which realistically will be almost never. Most times it'll be 0-20 lines long.
In that case xml would be needlessly confusing. It's not easy for non technical users to read.
Config as code is basically the default for (Python) Django applications, but for compiled language you would need to bake in a scripting engine or something. There's always sqlite, which is almost everywhere and very fast.
Lua is used quite a bit for that purpose, as it's small and easy to embed. I'm not a Lua programmer, but bodging together a few snippets of Lua with a well-defined configuration API is much easier than trying to do that with SQLite statements.
Indeed. About 15 years ago I picked up Lua for exactly that purpose on an embedded device that needed a somewhat sophisticated configuration. Readable. Comments included. If/else for a few things when needed. Was the best configuration experience I've had before or since.
I share the same sentiment. Provide frameworks not config DSLs.
People eventually want logic inside their config objects. We can argue how much logic but at a certain point, just using a popular programming language just makes sense because it’s familiar. I’d like my coding expertise translate into this part of the ecosystem.
That's where JSON shines, it's a properly "dead" format that can be trivially opted into allowing logic with the old console.log JSON.stringify, without even touching the unaffected lines (and all the git history they might be associated with). Certainly more trivially in some environments than in others, but I suppose it's close enough to the sweet spot of "adding logic shouldn't be too easy/shouldn't be too hard" wether the project is already nodeish or not.
The only downside is that if the objects in question happen to be properly documented in typescript, you will never want to go back.
If only there had been a formalized side by side from the start between JSON, the clean serialization format, and the JSON superset for human authoring (comments and quotes anarchy) that people have been informally reinventing hundreds of times...
I know that everyone hates YAML, but in my experience it's not that bad. I'm not using it to its full power, I guess, more like safe subset, and it works fine for me.
XML has its place, but I wouldn't want to replace my yamls with XML.
Github actions is fine provided you keep the YAML as short as possible and try to call out to a normal script to do everything - i.e. use it as config.
The problem comes when you try to use it as a programming language in and of itself (which MS encourages because it's a route to vendor lock in). It's a shitty, shitty programming language with shitty debugging tools.
To your second point: yes, yes, and very much yes.
Creating complex release workflows within GHA is hell given the debugging situation and that's what we're using because 'it's there, use it, everyone's using it'. Send help
I have literally left a 300k+/yr job, twice, due to having to write and maintain too much yaml.
Templating and significant whitespace is the worst. 10 commits in a row of "error on line 1" until helm finally parses what you want.
I am a software developer with extensive distributed systems experience. I like thinking about systems, not guess-and-check on config templates that become a bespoke DSL with no proper way to debug.
Our config structure requires similar but individualized configs for on-prem and the cloud. These can deviate for the region, the data center, the logical cluster, and at the individual node level for A/B or canary reasons. And of course local dev. You need to have a way to know where the deviations are, like different regional hostnames for services you connect with, different node sizes or replica counts, and different labels, any of which may or may not be correct and will cause an incident if wrong.
Search you yaml PRs, and I bet you will find a commit along the lines of "because yaml."
So a handful of well-known pitfalls and complaints about how it’s easy to represent the same information differently. Yeah, switching to XML definitely solves that problem.
I get your point but there is exactly no technology you can use without some known pitfalls, so you pick the one whose pitfalls are the least catastrophic or bothersome to you.
* JSON doesn't have comments. I could stop right there because that's a total deal-breaker for me for anything that's supposed to be read or written by humans.
* JSON doesn't handle multiline strings.
* JSON is not especially readable (no structure enforced, braces and double quote mandatory).
The lack of trailing commas isn't the problem. The problem is the intermediary commas.
Those are used in JavaScript to give you the option to provide an expression in place of a single term. But that's not even allowed in JSON, so the commas aren't serving any purpose at all.
It's annoying though because it's so mature compared to everything similar (even YAML, which I obviously quite like); like if you make an XSD for your config files everyone gets free editing features (not just validation, completion as well, even completion of attributes)
I have no big love for json or yaml, but i'd gladly take worrying about missing brackets over worrying about the handful of whitespace-related gotchas in yaml.
That's more of a problem with diff tools, though, which like almost all tooling and even programming languages, make the mistake of treating code as plain text.
OK, well, if you want to invent a new VC tool that can parse arbitrary languages into a syntax tree and make a useful diff and then convince everyone to use it instead of git then maybe I’ll stop using YAML.
I’ve felt like Python for config works pretty well, but I understand there’s a whole segment of folks who get twitchy at the thought of Python for anything. But just about every system has it or can install it, and you can do as much config-as-code as you want with it.
> If you want config-as-code (and you want to!), just do it properly and use a proper programming language for it. Don't care which one, be it JavaScript, Python, Go, PDP-11 Assembly, or Rust. But please stop with these half-measure DSLs that just don't cut it.
I wish more tooling used Nix, it's a great lisp that isn't ugly.
Nix, the language? It's very good for derivations (big, composed k:v maps), but it doesn't feel like a lisp, nor would I want to try using it for docker-compose or a config for pre-commit or an omega one description of an ML model parameterisation.
Then again I'd love to be shown that this partially-considered opinion is wrong.
(Also, isn't Nix's inadequacy as a lisp one of the reasons for the existence of Guix?)
I actually use Nix to generate YAML files for GitHub Actions in one of my repositories. It allows me to share code between multiple actions and be consistent with versions/formatting.
Wow, this is truly terrifying. It reads as if Donald Trump was the maintainer of a popular library. "Low quality tooling! Sad!!!".
He claims with a straight face that every software using his library only loads YAMLs from sources 100% trusted to execute code. He is given example after example to the contrary, which he ignores instead opting to constantly blame "low quality tooling" for generating "false reports" about his perfect software. People can be really weird sometimes.
A lot of people misunderstand what he's actually saying.
There are two categories of constructors. One is for data that should not be executed, the other is for trusted data that should be executed.
There are two libraries. One has default constructors that can execute data, the other has default constructors that don't execute data.
He's saying to rtfm and choose the library with the correct defaults, choose the correct constructor from that library, and stop trying to take away the choice.
> He's saying to rtfm and choose the library with the correct defaults, choose the correct constructor from that library, and stop trying to take away the choice.
Nobody was trying to "take away the choice".
The problem is that you have to explicitly opt-in to be safe. If you followed the code snippets from the README, your application would be vulnerable to RCE without you realizing it; as people pointed out, it would be more secure to have Constructor (safe by default) + DangerousConstructor rather than Constructor (unsafe by default) + SafeConstructor.
His argument was that "100% of applications using SnakeYaml do not accept untrusted data".
I understood him to be speaking tautologically, that you trust the data when you choose the trusting library without using the safe constructor, whether or not you realize the implication. He seems very well informed that some people are using the trusting constructors on untrusted data.
As he explained, this library is, by design, convenient by default. Those seeking safe by default should consider using the other library.
"take away the choice" is my summary of several comments that would have the feature removed. One was about how its existence is a vulnerability if file access is compromised. Another was about how code execution is not in the spec. And so on.
> I understood him to be speaking tautologically, that you trust the data when you choose the trusting library without using the safe constructor, whether or not you realize the implication.
He was speaking literally; he even rejected several of the provided examples because "users have to login first, therefore the data is trusted" which isn't an argument that any security-conscious person would make.
> As he explained, this library is, by design, convenient by default. Those seeking safe by default should consider using the other library.
That is a negligent mindset to have. Log4J added the ability to execute arbitrary dns and ldap calls for the sake of convenience, which resulted in one of the most consequential vulnerabilities of the past decade.
Opt-in security is dangerous and should never be the default — especially when the feature in question is executing arbitrary input.
He also said to sanitize any data that you intend to use with the unsafe constructors. Taken together, he's pointing out that you decide how much you trust the data and you control which constructor to use. "Problem in chair"
"Should" statements are always relative to what you value. Clearly he thinks this trade-off is fine for him. His other library accommodates your security needs but this one accommodates his convenience needs. Can the man not make something for himself?
I assume it would be costly for him to make and propagate the changes. Maybe money could persuade him.
> He also said to sanitize any data that you intend to use with the unsafe constructors. Taken together, he's pointing out that you decide how much you trust the data and you control which constructor to use. "Problem in chair"
That doesn't change the fact that it's a poorly designed API that's insecure by default. There are countless situations where people are inadvertently exposed to risk via transitive dependencies, at no fault of their own.
> Can the man not make something for himself?
He did not make it for himself, he made it to he consumed by others. SnakeYAML is a widely used package.
He said he designed it for his use case of executing trusted configuration code, which some others appreciate. Obviously there was some misunderstand about the goals and priorities of the project.
Making this change would cost him something he values without giving him something else he values in return. According to him, Snake Engine already provides a default safe solution, so he's not leaving anyone without a remedy. It would cost you to switch, but you would get something in return. That seems fair to me.
why isn't the default secure? if the default isn't secure we have learned time and time again that people will use the default unknowingly exposing themselves to security holes.
Here's just a couple examples off the top of my head:
- `$variables` in bash are subject to arbitrary code execution via word splitting without escaping
- PHP register_globals
- PHP, express, and some others parse `?a[b]="foo"` in a query string as an object, allowing for prototype pollution or other exploits
- string concatenation for SQL + escape_string being the default for years
- perl array expansion in function calls
- XML entity inclusion on by default allowing you to read arbitrary files
- log4j executing arbitrary code inside its logs
- passing a variable to printf's first arg
- no difference between escaped and unescaped tags in php
XML does not, in a single line of code with no preknowledge of the document, deserialize into a map or array (of nested maps/arrays as needed). It cannot be mapped easily into domain objects/datastructs without extensive mapping info.
Instead, you need to describe the structure of the XML, have preknowledge of prefixes and meanings for namespaces, have to deal with CDATA crap, have directives, config-in-comments, and hosts of other annoyances.
XML sucks. I programmed from 1995 to the present. XML sucks. YAML is far far far far far superior.
The ONLY good thing about XML is XPath. That's it! XSLT? awful. schemas and other validations? horrid.
XStream (java library) was the only thing that made XML usable, and the second JSON (and later YAML) came out, I dropped it immediately.
That’s called “schema.” It’s a pretty out there concept, I know, the idea that you should be burdened to document the structure and intent of your data for both human and mechanical consumption. I realize I’m being forceful here, but keep in mind, you are compensated incredibly well. If it takes you another hour to save ten down the road, earn the pay. I don’t understand this aversion to tough stuff - which seems to be pretty popular here - and I’m starting to think I should interview for it a bit harder than I already do.
The problem with this thinking is that you, personally, are then forbidden from arguing for the use of a strictly typed language for development because it’s the opposite position to the one you’re holding here. The exact reasons we use languages like those are the same reasons we should be explicit with our schemas. It’s unfortunate that many people try to argue both sides due to the convenience, as you say, of a single line parse, when years of experience has taught that duck anything is a bug fountain. (Not saying you are arguing both, by the way, it’s just common.)
Try reading back your gripe with the following in mind: do I have a stronger complaint than “it’s difficult” here? I think you’ll find that you don’t convey one effectively.
I have seen the same aversion and lack of pragmatism so many times it has started to impact my motivation.
Examples:
1. When taking over a project, developers glanced at the code and decided it would be better to spend 6 months rewriting from scratch. The end result was not more readable than the original solution and introduced a new set of issues.
2. Many put too much emphasis on the worst case scenario and do not consider the average case. I worked a lot with many different XML formats and most of them were OK. Not "fun", but simply OK. I have to admit that I did struggle with some complex files, but there were plenty of times where the XML was simple, readable and easy to work with
3. When comparing programming languages they often focus on a few features and don't think about productivity in general. Languages like Java can actually be very productive, even if your favorite language can reduce null checks.
There's a happy medium to be found here - balancing ease of use, while avoiding the bug fountain. Having said that, IMHO we should err on the side of avoiding the bug fountain. Make it as simple as possible, but not simpler.
Robbing the future with deceptive over-simplicity - by creating a bunch of future difficult debugging scenarios and possibly footguns - is the far worse evil, than missing out on the maximally-convenient onboarding (which can be foolishly optimized for, for the sake of short-term popularity). All such crap-tastic solutions will eventually need to be replaced again anyway, creating an endless, hellish, slow churn.
I like rigid schemas for write and supporting both loose and rigid reads. Generally the friction in a system like that is something like the ad hoc SRE trying to load up a type stack just to interpret a protobuf log. Or Tableau. Stuff like that. That’s where people get annoyed. I think you’re right and there’s a lot of unexplored directions of simplicity.
Computers that understand the shape of your data are very helpful friends when you’re pursuing goals like data locality.
The problem is that XML doesn't even map that well even to objects and classes, even with annotation. And XSD is quite a heavyweight format with terrible UX.
I'm pretty much consistently on the strict/type-safe side of the "should we have a schema" debate, but there are better options out there to maintain a consistent schema for either data interchange.
JSON is simpler to map, faster to parse, simpler, more lightweight, and less dangerous to add to an online app[1]. You can also use a schema like JSON Schema for inter-app compatibility. It has replaced XML as the standard data interchange format for a reason. It's not great for configuration files, and it's definitely being overused nowadays, but it is a solid data interchange format.
Then you've got binary formats like Protocol Buffers which are even more lightweight and faster to parse and (generally) have schemas that map better to typed languages.
I think the OP has it right: XML is very well-suited as a generic document format. I wouldn't compare it to YAML, because in a perfect world they shouldn't compete in the same categories: Nobody should use XML for configuration or YAML for documents.
And I also agree that there are better formats than YAML. I like the ease of writing indented multiline strings in YAML, but the fuzzy typing is pretty terrible. At least YAML 1.2 fixed the Norway problem.
> JSON is simpler to map, faster to parse, simpler, more lightweight
JSON is no simpler (nor more complicated) than XML, if you're using a library. It certainly isn't faster - a SAX parser is faster than a JSON DOM parser (and JSON streaming parsers equivalent to SAX is rare).
> It has replaced XML as the standard data interchange format for a reason.
the reason isn't technical. It's competency (or lack thereof). Most interchange formats are for websites in browsers, where JSON performs well, since there's no native way to parse XML in the browser. So that mindshare from the web has leaked out to other arenas.
That's crazy talk. JSON is very simple. The irreducible complexity of JSON is lower than irreducible complexity of XML, and irreducible complexity of XML is lower than YAML.
Json is simple, but the common idiom is to read it al at once into memory and forget the difficult stuff (which encoding is this json in anyway). Then don't do the other difficult stuff (is this a string or a date type)... so it's faster (maybe) to parse but it doesn't scale for large datasets and your application will need additional deserialization logic. A broad sweeping performance statement like this is just spreading fud.
Regarding security: entity expansion bugs have been fixed long ago. In the other hand: people still use eval() on json objects to parse them. So i don't get that.
Json schema: which one? Afaik it's not there and there is no single json schema with the tooling depth and breadth of xsd.
Protobuf: nice but unreadable for humans. Might as well use corba or asn.1 coding.
- say you have an incoming data document. Say you need to programmatically read it / scan it / extract from it (think: cli and pipes).
You want to access this data for any of dozens of reasons. Graphs, logs, data points, transformations, data feeds, whatever.
I can do that task in json and yaml 10000% faster than with XML. With XML, you may have a schema (hope the document matches the putative schema!). Oh the schema is an http reference? Hope that still exists out there, the internet never breaks links. If you don't, well shit, is this tag beginning a list or a "subdocument"? Am I REALLY using the DOM api to step through nodes and attributes and CDATA? Guess I have to. There goes a day of coding.
Oh, in JSON and YAML, it's ONE LINE OF CODE to get it into something I can easily read, manipulate, analyze?
- say you have an upgrade program. it just needs to read in the old config file, rename some keys, add some new default values, etc. JSON/YAML? I can do that in stupid-simple code. XML? Well, I better hope there exists a library that loads this shit for me in my preferred language, or otherwise lots of fun with DOM. I forget, can I use regex to parse XML? (that is a joke)
- say I want to serialize an object graph pretty quickly for over the wire between languages. Do I want to write a complete XML mapping in two languages, or just do the one-line serialize, one-line deserialize? Yeah.
- say I want my config files to be somewhat extension friendly for plugins / extensions. XML parsing code? Yeah, that will be a ton of custom code. YAML/JSON deserializing to a map/dictionary? Oh, look at that, extension friendly code. Allow them to specify whatever json/yaml struct in their plugin section and pass it to the extension.
This stuff happens with such frequency that I never, ever think "man I wish this was XML".
Do YAML/JSON have some issues? Do I wish XPath and some XML features have json equipvalents? Sure ... very occaisionally. Actually, never.
Where to begin...
If the schema is not there you are in the same position as with receiving some json or yaml: a bad one. Is 20230806 and integer or someones idea of sending a day?
Parsing the data into a memory structure is a one liner in any language. Assigning meaning however..
Object graph between languages as you subscribe works only in a very small number of cases. Sent your data to a mainframe and see it disappear faster than you can sent it.
Extensions is where xml shines: use an extension namespace. Hell use a namespace per plugin. Unknown namespaces are normally ignored during deserialization if you use a schema, so no line of code needed at al.
All the issues you describe boil doen in my book as: i don't know how to do this properly with xml so xml is bad.
I think there's a happy middle ground that I'm shocked doesn't exist or is obscure as hell because I've never found it in the wild is a "default" schema that is compatible with JSON types and can be serialized the same way JSON does. Because whenever it comes up on the on-disk format seems pretty irrelevant people just want to have an xml.load to map to their language primitives in a sane way.
There's this kind of stuff but it's niche and more convention than actual schema.
I think there could be, but it's ugly. The XStream library wasn't horrid, but it basically provides a way to serialize/deserialize from the java object defs, which is still a schema of sorts.
You know <key name="blah">key value</key>. It just highlights the extra verbiage, and pushes people towards "just use json/yaml".
But those are good things. Things should be mapped otherwise you are building a house of cards that may or may not die in the future because someone in the makes some changes without realizing the repercussions.
XML is great for interchange, integrations and specifications. (but config not so much).
Ehh, I don't know. Sometimes I want to put code in my config, but there's value in not. It's more portable and easier to parse with various tools and different languages if you stick to some format.
Hah, I wrote my comment before reading the top comment (yours) and I agree. Java PTSD ruined XML, but it is still - to this day - superior to everything else that came after it because of XSD / XSLT and the like.
> No offense to the creator of YAML, but: The fact that it became one of the de-facto standards for cloud tooling is an absolutely damning statement about the state of the industry.
I think the creation and take-up of yaml over something like s-expressions is an indicator that the influential practionors in the cloud industry are all young and inexperienced.
If the creators of all these cloud tools display such poor judgment, it makes the tool itself suspect.
The Nix Language, while goofy at times, is built for config-as-code and is hiding a decent little functional language in what looks like just attribute assignments.
Likewise, CUE Lang is built for config (esp merging docs with shared refs) and is highly under-appreciated. You can express powerful computations if you puzzle over the logical inferencing for a bit.
And it always will be, cdktf is the same but on Terraform objects. At some point you have to reach the end and spit out the final list of stuff you want.
If you want full imperative just use the AWS sdk or Ansible but I think people have realized how not maintainable that pattern is.
Every time I need to do something in CDK I am reminded of https://xkcd.com/2347/ except the whole thing is my config and the tiny bit is the thing I want to understand how to change.
When programming in Groovy, using Groovy maps as config was pretty common, and allowed you to sprinkle a little bit of logic here and there is things got complicated. But it was just Groovy, no special cases, no weird escaping rules.
> The fact that it became one of the de-facto standards for cloud tooling is an absolutely damning statement about the state of the industry.
The state of the industry is... perfectly fine? I really don't get why people hate YAML.
To me the "programming language as config" idea is just like Lisp-like macro. Yeah it's powerful, but once you have more than 2 people working on it you get DSLs (s for plural).
People, at least Lisp people, like to claim the line between code and data is very blurry. In my experience in real world it's blurry 1% of the time. In most cases code is code and data is data.
All you need is a good XML editor. It could even make it look like YAML so that it is nice at read time, and even let you edit it like you would a YAML but save it as XML and enjoy the validation etc. tooling.
I've always been of the opinion that YAML is inappropriate as a configuration format, period. It has ambiguous parses, which is an immediate no-no.
I remember years ago I was writing my own parser for YAML and when I came across this problem I posted on a few mailing lists and got confirmation that the behavior is per-spec. I've never touched YAML since and still won't.
But I miss the days of XML, XSD, and dare I say it ... XSLT. XSLT is stupidly good at what it does as long as you don't abuse it.
The problem with using a general-purpose programming language for configuration is that you lose the ability to statically interpret it. Maybe one solution is to make sure the configuration SDK is fully side effect free, so that it's always easy to run the configuration with fixed inputs and get a deterministic output.
I will never understand your problem with it. JSON or YAML should be enough for 90% use cases. And INI or TOML the rest. Let the XML relic rest in peace please.
Bah. While YAML is far from perfect, it's fine for random config files written by humans. TOML is mostly better, but from its own homepage (https://toml.io/en/):
[servers]
[servers.alpha]
ip = "10.0.0.1"
role = "frontend"
[servers.beta]
ip = "10.0.0.2"
role = "backend"
is ugly to my eyes. And I'd rather swallow my own tongue than have to hand-edit XML in the most common case where there's not a dedicated editor for that specific doctype.
In fact, I'd echo the linked article's argument back: I don't know of a case where XML is the best option. For human-edited files, pick almost literally anything else. For serialization, JSON handles the common cases and protobufs-and-friends are better when JSON isn't enough. There's not a situation I can imagine where I'd use XML for a greenfield project today.
{
"servers": {
# Frontend server is called alpha
"alpha": {"ip": "10.0.0.1", "role": "frontend"},
# Backend server is called beta
"beta": {"ip": "10.0.0.2", "role": "backend"},
}
}
Or, depending on your preference:
servers:
# Frontend server is called alpha
alpha:
ip: "10.0.0.1"
role: "frontend"
# Backend server is called beta
beta:
ip: "10.0.0.2"
role: "backend"
All of them suck in their own way. All of them work fine with autocomplete, type analysis, and autoformatting. The more things change, the more they stay the same.
JSON lacks comments, that's the biggest differentiator in my opinion.
Your YAML example doesn't need double-quotes around the IPv4 addresses, but then very confusingly and problematically does need double-quotes around an IPv6 address, due to the colons.
This creates a serious footgun in Ubuntu netplan, leaving a server totally unbootable, but simultaneously not triggering "netplan try" as any sort of parsing problem:
And Javascript doesn't need semicolons in all but three exceptional cases, but I still consider it good form to strongly type values whenever possible.
Getting in the habit of not doing so will lead to schema violations, like the Netplan problem you linked, which can crash the program trying to read your config. If it bails out at an unfortunate time, like most networking tools seem to do, you'll need to use a recovery boot image or serial console to fix your config.
> This creates a serious footgun in Ubuntu netplan, leaving a server totally unbootable, but simultaneously not triggering "netplan try" as any sort of parsing problem:
Been there, done that. A good config file or linter should’ve complained and not allowed me to commit such misstake.
JSON is by far the easiest to reliably parse. It doesn't rely on tabs or spaces which YAML suffers from. XML is just more verbose JSON without an array object, and has some redundancy in spec which is not a good design.
Lack of comments for JSON isn't a huge issue considering you can make the keys fairly verbose. And it would be actually pretty easy to add this into the spec, and parsers would still be backwards compatible.
It's my preferred configuration file format, it fixes all the problems I have with JSON (trailing commas, comments) without turning it into a mess full of gotchas like YAML.
A theory - they don’t want a third party code in a hot path (they do care about performance in vscode), they already have a very performant parser and they don’t want to add a complexity there.
All of them are easy to parse. I personally prefer the clarity of indented YAML over the endless nesting of {} JSON brings, but all formats are easy enough to read or write.
Lack of comments in JSON is a huge problem for config files. It's not an issue if you're just exchanging data between APIs, but for config files, comments are essentials.
There are some JSON specs that will do comments, but they rarely specify what dialects of JSON their parser accepts. There are also workarounds that abuse the fact duplicate key handling isn't part of the spec by specifying each key twice, once with a comment and once with data as most parsers only make the second key stick; those are even worse.
You can't add backwards compatible comments to JSON, there's no space in the JSON spec to retroactively insert comments somewhere. The closest you can do is the duplicate key trick, but as the spec doesn't state which of the keys to read as a value, that trick only works with specific parser implementations.
YAML has newlines that indicate the end of a field... unless otherwise specified, which then gets into issues with line endings. Indents can also cause issues, considering a space inserted somewhere by accident can mess up your whole document. JSON and XML rely on specific tags for elements, which are much more reliable, and thus easier and faster to parse.
You can easily add comments to JSON spec, by just writing every parser going forward with the added comment parsing. It would read old JSON non commented files just fine.
- introduce the least idiomatic form of YAML as "depending on preference"
- add an optional preamble to the XML example
- add comments when there were none, but also not to all of them
If you didn't artificially stretch the different examples to match, there'd be a much clearer difference between them all, especially considering the fact one of your three examples is a superset of the other.
The odd one out can't even capture an integer vs a string without a schema.
I was puzzled by the GP's choice to write the first YAML example in a completely unidiomatic way.
But I think the point of adding comments was just to show that comments are possible in some formats, and not others. Omitting possible comments from the XML example might have just been a sign of fatigue over this topic ;)
At any rate, I find the (idiomatic) YAML example to be -- by far -- the most readable of all, including the GGP's TOML example.
- To showcase to the many people who aah JSON is more legible that you can use YAML as "legible" JSON
- Most XML files I encounter come in this format. You can skip the preamble but it wouldn't match my real life experience.
- All readable config files I encounter have comments. I forgot to add comments to the XML representation, but I can't edit my comment anymore. I think everyone who ever encountered XML knows how to add comments, though. JSON simply doesn't support comments unless you use a niche JSON derivative.
As for the string versus integer problem: you always need a schema, or you'll run into very funny problems down the line. None of these formats intrinsically know what keys refer to an object and what keys refer to a string, that's all based on your schema anyway.
"10" is a string. 10 is a number. [10] is a single number in an array. {"number":10} is an object.
You're conflating advice for databases with advice for data serialization formats: XML captures less information about the data it contains intrinsically.
_
Also please don't use YAML as "JSON with comments", you're just asking to run into some obscure bug/corner case
If you're willing to do weird things there's always JSON5
I've gotten bit by trailing commas enough times (both manual edits and writing generators) that I absolutely expect any reasonable syntax to tolerate them. It's just so much easier and more consistent to tolerate them.
That's what Clojure got right. Comma is whitespace. When you print a datastructure, it has commas. But they don't effect reading of that structure. Its brilliant and practical.
It does. However, it’s much more common to edit the end of a list, in my experience. Still, a syntax that is entirely uniform (like trailing commas) is preferable, in my opinion.
JSON5 does, but most software just does plain and simple JSON. I haven't seen it used outside some Javascript webdev environments. The JSON5 docs also seem to be specifically targeting Javascript development.
If you're sticking to certain variants, you may as well use YAML, which supports JSON notation, as well as comments and various other improvements.
If you use python, there is an excellent json5 module. But true, it may not be as well supported by other languages.
I am not sure I may as well be using yaml. I don't like it for the multiple reasons in the OP and this thread.
If you are using python, I have found it to be quite easy to support both json5 and yaml, as well as converting between them for people who feel strongly about yaml. Not trivial but low effort.
None of them handles it well. YAML has 7 different modes to do it so you will inevitable mix up and use the wrong one, otherwise it's actually the only option that supports it. Json requires inlining \n. Xml only does it with whitespace indentation.
For human-maintained config, TOML is only "better" when the structure is so flat that it's almost indistinguishable from an INI file.
Anything more complex and it becomes one of the worst choices due to the confusing/unintuitive structure (especially nesting), on top of having less/worse library support.
YAML's structure is straightforward and readable by default, even for fairly complex files, and the major caveats are things like anchors or yes/no being booleans rather than the whitespace structure. I'd also argue some of the hate for YAML stems from things like helm that use the worst possible form of templating (raw string replacement).
I'm with you on all that. I think YAML's fine, and I like it way more than TOML for non-trivial files.
I think Python's pyproject.toml is a great use of TOML. The format is simple with very little nesting. It's often hand-edited, and the simple syntax lends itself nicely to that. Cargo.toml's in that same category for me. However, that's about as complex of a file as I'd want to use TOML for. Darned if I'd want to configure Ansible with it.
Agreed, I do a lot of Ansible, and it took me a while up front, but I've become pretty accustomed to YAML. Though I still struggle with completely groking some of the syntax. But, I recently took a more serious look at TOML and felt like it'd be a bear for Ansible.
A few months ago I made a "mini ansible / cookie cutter" ( https://github.com/linsomniac/uplaybook ), and it uses YAML syntax. I made a few modifications to Ansible syntax, largely around conditionals and loops. For YAML, I guess I like the syntax, but I've been feeling like there's got to be a better way.
I kind of want a shell syntax, but with the ansible command semantics (declarative, --check / --diff, notify) and the templating and encryption of arguments / files.
> For human-maintained config, TOML is only "better" when the structure is so flat that it's almost indistinguishable from an INI file.
Agree. I've recently inherited a python project, and I'm already getting tired of [mentally.parsing.ridiculously.long.character.section.headers] in pyproject.toml.
Seriously, structure is good. I shouldn't have to build the damn tree structure in my head when all we really needed was a strict mode for YAML.
> I'd also argue some of the hate for YAML stems from things like helm that use the worst possible form of templating (raw string replacement).
I was literally speechless when I saw helm templates doing stuff like "{{ toYaml .Values.api.resources | indent 12 }}", where the author has to hardcode the indentation level for each generated bit of text like a fucking caveman.
The tiny examples might look kinda okay, but when someone has stacked 10 different patch operations in a single file, it gets a lot harder to keep track of what's going on.
“Nesting is bad” is such a simplistic take. Nesting is absolutely essential and inescapable. What that statement is really doing is placing a limit on what whatever it applies to can be used for. It would be better to spend a few more words expressing what you really mean.
Your comment is a simplistic take on "Nesting is bad" given the context.
It's not hard to infer that they're referring to nesting as a footgun: make it harder and you lose some power but you keep your feet.
Config files are a poor place to complex and deeply nested relationships. If it's not ergonomic to reach for nesting people tend to be forced to rethink their approach.
The problem is "config" means different things to different people. Some people see config as "the collection of runtime parameters" basically a bank of switches: Pyproject.toml is config. Others see any form of declarative structured data ingested by a runtime as config: docker-compose.yml is config.
And of course to minimize impedance mismatch, the structure should be similar to the domain.
So yes I want a "config file" to handle at least a dozen levels of nesting without getting obnoxious.
Then I guess to frame it in your language: they want formats that encourage config files, not "config files".
And I don't disagree. The problems of nesting objects "at least 12 levels deep" aren't going to be solved by the right format. The tooling itself needs to expose ways to capture logical dependencies other than arbitrary deep K-V pairs.
What if your problem is best expressed as "arbitrary deep K-V pairs"? It's going to be more common than not, nesting really is that fundamental.
There is no escape, you can't win. If you want the nesting, and assuming you can't remove it from the problem itself (as you often can't, or at least shouldn't), there's only one thing you can do: move inner things out, and put pointers in their place. This is what we do when we create constants, variables, and functions in our code: move some of this stuff up the scope, so it can be used (and re-used) through a shorthand. It loses you the ability to see the nesting all at once, but is necessary (among other reasons) when the nesting is too large to fit in your head.
Of course once you do that, once you introduce indirection into your config format, people will cry bloody murder. It's complex and invites (gasp) abstraction and reuse, which are (they believe) too difficult for normies.
The solution is, of course, to ignore the whining. Nesting is a special case of indirection. Both are part of the problem domain, both are part of reality. Normies can handle this just fine, if you don't scare them first. You need nesting and you need means of indirection; might as well make them readable, too. Conditionals and loops, those we can argue about, because together they give a language Turing-complete powers, and give security people seizures. And we have to be nice to our security people.
This is whining that people won't endorse a lazy, poorly scaling approach to an engineering problem... and justifying that approach by conjuring hypothetical whiners against a common, better scaling solution.
If you need 12 levels of nesting, add indirection, or live with the fact no one is designing formats to enable your oddball mess of a use case.
12 levels of nested braces in a single function is already a crappy idea: it's an even more crappy idea in a config file because of the generally inferior tooling, and now there's a downstream component that needs to change to support a cleanup (meaning it almost never gets fixed and the format just gets worse over time)
> For human-edited files, pick almost literally anything else.
I'd still take XML over JSON for human-edited files. At least XML supports comments.
> For serialization, JSON handles the common cases
Counter-point, JSON sucks and is way overused. The types are too fuzzy, the syntax too quirky, and validators/schemas are almost never present. You can bolt that all on, but it wasn't designed for it and it shows. It was designed to be eval()'d, which you should also never do because it's a terrible idea. It's flawed at the foundations.
Those are valid points, and while I have a different opinion, I can't say you're wrong about any of it.
But I will say that the first time I used a JSON API that had replaced an XML one, I almost wept with relief. Perhaps because JSON is so simple, it pushed APIs toward having simpler (IMO) semantics that were far easier to reason about. Concretely, I'll take an actual REST API (that is, not just JSON-over-HTTP) over the SOAP debacle any day of the week. I know you can serve XML without using SOAP, but to me they're both emblematic of the same mindset.
> Counter-point, JSON sucks and is way overused. The types are too fuzzy, the syntax too quirky, and validators/schemas are almost never present. You can bolt that all on, but it wasn't designed for it and it shows.
XML's validation, schemas and typing are far more complicated and equally useless - the impedance mismatch is too big, all they do is give you a whole bunch of extra ways to shoot yourself in the foot, particularly in the presence of namespaces. If you want something fully structured, protobuf or equivalent is the way to go (and converting back and forth between protobuf and JSON is relatively painless).
XML Namespaces are one of the worst anti-features I have ever encountered. I have yet to see a legitimate use for them, but they make parsing way more of a pain than it needs to be.
Also the use of attributes vs nested tags seems pretty arbitrary and in my experience attributes are hardly used at all.
They have a parse.y that sort of gets traded around and joins new projects. Nothing super formal but it does mean most openbsd service configuration feels like each other. but each config is tailored to it's application. and being properly parsed the error messages can be better.
In fact that is my biggest beef about yaml, I mainly use it in the context of ansible, and the parsor usually has no clue where in the file the error actually is. You have to depend on remembering where you last edited to actually find the error. My other big problem with yaml is that the ansible context is trying very hard to make it a programing language... And while it is an okish config language it is a terrible programing language.
In fact this is a common problem with many complex environments. They want to try and push this complicated setup into a config file and claim "look it is easy, no programing required" when really what they have done is to push a programing situation into the worlds worst programing language. see also: xslt
# Diff for interactive merges.
# %s output file
# %s old file
# %s new file
merge="sdiff --suppress-common-lines --output='%s' '%s' '%s'"
it's useful the first time you dive in, not having to read the man page. But over time, the comments can get out of sync especially if you don't carefully merge in the package-maintainer's version every update.
Do you mean non-json types? Because the supported types seem pretty straightforward. (besides perhaps supporting null bytes in strings in things like postgresql)
> the syntax too quirky
Care to explain? This has always seemed like one of jsons strengths. The syntax for what is valid is pretty straightforward.
> Do you mean non-json types? Because the supported types seem pretty straightforward. (besides perhaps supporting null bytes in strings in things like postgresql)
What is a "number"? Is it a float? int? short? double? BigDecimal?
What about time values? Or dates? Oh, you have to just shove those into strings and hope both sides agree? That's fun.
> Care to explain? This has always seemed like one of jsons strengths. The syntax for what is valid is pretty straightforward.
One example is the json "spec" on json.org does not allow trailing commas yet many parsers do
Yeah, I personally find XML elegant and well-designed compared to the popular alternatives today. And things that supply what’s missing (for example, JSON-LD and JSON Schema) aren’t much less complicated than the XML equivalents.
Of alternatives, I think EDN is the closest to being a satisfactory replacement because it supports namespaces.
Tangentially: there are some objects in the Kubernetes configuration that require the data to be base64 encoded (I think it's secrets and config maps, probably something else). When I was preparing for my CKA certification, I used the (most?) popular course that introduced base64 encoding as a _security measure_. I think that also says something about the state of the industry.
No no no, k8s mistake was actually not using YAML hard enough. They built an object system on top of a format that can act as a typed serialization format for generic objects and then decided to just ignore all that and implement it on top of primitive types
in terms of something i'm shipping to users, i would much rather have to worry about the nuances of TOML than of YAML. with TOML, i don't have to worry about e.g. remote code execution because someone figured out a clever way to trick my yaml into running arbitrary code somehow. that kind of shit is annoying.
is it uglier? sure. but it's peace of mind...
in terms of config i'm using myself, say Kubernetes stuff, i really love YAML...because i know exactly what it's doing and i generally keep things simple. it's nice for that, it just does way too much IMHO...
I love it. Was so happy to see that syntax when I first encountered Toml. Hierarchy is always clear without having to scroll, and snippets retain context.
> I'd echo the linked article's argument back: I don't know of a case where XML is the best option.
From the article:
> “I’m making a new kind of book, and I need to annotate all of the verses in the Bible, and have the chapter headings and stuff.”
XML is a markup language, and works great for marking up text. It came out of SGML and attempts to make machine-usable documentation.
I'd favour its use in something like a datasheet, which is a combination of human-readable information, nested objects, and lots of stuff that needs to be machine parseable in fairly precise ways.
IMO it also works "fine" for other structured document formats that aren't text, like SVG, but that's not a strong opinion. JSON and other formats compete more sensibly here, but I'd never want a protobuf-based format for... writing an essay, for example.
XML is pretty good for information extraction with LLMs. It will parse your input text into a tree structure. JSON and YAML will conjure slightly different skills from the LLM. Maybe it all comes from the slightly different applications of these formats in the training corpus.
An flat toml file is indistinguible from ini except one point: you can create an pure key value file without sections but not with the most ini parsers i saw.
And yet it's the only one out of all those mentioned that can be written with ease by humans. After a short while you can do several levels of nesting without thinking about it. For all its issues, it's by far the most practical which is why it won.
Practical but incorrectly used. We should correct the misunderstanding so that people don't continue to make similar mistakes, inconveniencing themselves and others.
i agree, while YAML has its problems, it is good for human-written content, especially when it's not too deeply nested.
in short: all these have their use-cases. i use YAML where it fits and TOML where it fits better. I never use JSON because JSON is just for machines, I generate JSONs.
I, for one, am displeased with .yaml. I recently had a major footgun incident where a Debian VPS was rendered completely unbootable because of Ubuntu netplan's highly-annoying use of YAML (where I had done the slightest misconfiguration, and to my eye it looked perfect, and my changes successfully passed the parser of "netplan try"). Yes, that's right - the server needed to be rebooted in rescue mode; it wasn't just merely stranded with no working network interfaces, where the web-based serial console would have been enough to undo the footgun gunshot. Nightmare!
The solution was to uninstall netplan.io Debian package where it didn't belong - get that YAML out of there. My hosting provider, OVH, figured it would be a good idea to shoehorn netplan - with its accursed YAML - into Debian for Network configuration. Bad move.
+1 for TOML. I love it's usage in innernet config files to set up new clients with a single generated "invitation" file.
I've broken plenty of networks by adding typos to /etc/network/interfaces and then running ifdown;ifup. Misconfiguration can happen no matter what format your config files are in.
Yes, without the quotes, the IPv6 address gets interpreted a YAML mapping/dict because of the colon(s).
Perhaps the trap is the complacency that YAML induces by not requiring quotes around keys/values, and so text risks being interpreted in unexpected ways. The infamous Norway Problem has the same root cause.
Even in the first position? That makes it no longer be the claimed "superset of JSON", since (AFAIK) {"key":"value"} (with no whitespace) is valid JSON.
For delimited collections, so within {} and [], if the key is quoted, then you don't need the whitespace. So your example parses as expected as does `{"key":value}`, but `{key:value}` turns into `{"key:value": null}`.
That's odd. Looks like Netplan looks like uses libyaml, and vanilla libyaml definitely parses that naked v6 addr as a plain scalar. Maybe Netplan adds an extra schema on top or something? If you know what data the bare v6 string turned into, I'd love to hear it.
Yaml is like JSON, in that the format requires you to think about strings vs integers.
I bet the parser did actually fail, because you can't parse a misformed dictionary into a string. Your problem is that the tool probanlt took down the interface before trying to parse for format, and then failed to bring the interface back up.
Similar to bogus data in /etc/network/interfaces, resetting the network interfaces with bogus data will end up with your server having no or limited connectivity capabilities.
At least with YAML there are command line parsers available to check your work. Plaintext config files often end up being a game or chance to see if you've got the format right.
Unpopular opinion I guess, but I really like netplan. I like having the vlan, bonding, ip, routing, dns configs all in one place, and it's worked well for me over 4-5 years.
In my rather short experience with netplan, I found:
As above, you can’t ask netplan to sanity check a config.
You can’t create a draft configuration, apply it, and save it if it works well. Every self-respecting network config system since at least Cisco IOS can do this (and does it by default!).
Interface renaming can’t filter by being a physical interface, which means that the system tries, and fails, to rename VLANs, because their MAC matches something that should be renamed. (networkd can handle this, but the networkd config written by netplan is wrong.)
Deleting virtual interfaces (e.g. VLANs) seems to be essentially unsupported, at least on 20.04. I think it’s slightly, but only slightly, better in newer releases.
Ouch, quite a lot of 'learning'! Color me unimpressed, too.
I've grown to enjoy NetworkManager. I know, like all things, that is probably controversial to some.
Two things I really appreciate about it:
- You can 'up' a connection/interface in an idempotent way; only changing whatever is needed.
- it's *very* scriptable. Values can be given with +/- operators
I was surprised/frustrated with networkd initially, but enjoy it now.
It getting involved with packet forwarding was an unwanted surprise during a modernization effort.
And then OVH makes you hand-configure your IPv6 address, with netmask and IPv6 gateway - no DHCP6 for their VPS servers. Then they put the footgun in your hand by not mentioning the double-quotes requirements which YAML has for IPv6-with-netmask.
When you have a linux-only scenario - say on your laptop, and servers, which is my case - innernet's simplicity and fairly-good elegance is tough to beat.
As soon as Windows/MacOS/Android/iOS clients want in on the fun, alas, you'll need something more complicated than innernet to accommodate these other clients.
XML's curse was that it started with all of the complexity that everyone will eventually add to any possible replacement. Worse still, people felt compelled to use every feature that they could for esoteric reasons. Even worse, the big companies did not actually work well with each other's document definitions. Even ones that were supposed to work well together.
That said, you only need to look at the abomination that is an OpenAPI spec that has been annotated to work with AWS to see that the pitfalls are still mostly the same today. (I can also complain about the breaking changes from the old specs.)
XML is full of complexity that it shouldn't have and any competitor would be stupid to adopt.
For example, either your format is a programing language, or the data evaluation must never need an internet connection. XML isn't one, yet still requires it.
If you want to implement the standard, you parser must be prepared to download extra data when the file requires (it doesn't actually need to download schema, even though it's the naive way to implement it).
A lot of parsers do not implement the standard, and they are better for it, because this can create huge security issues. But it's something you have to be always aware of, and could change on any minor version update.
Right, if you are willing to validate any and all random documents, you will have to have some sort of way to get the schemas. I can think of very few reasons to validate unknown schemas in an application, though. Would be like allowing your parser to take in external entities. Can be useful and there are valid reasons for trusted sources. But not for random documents from the web.
And much better at it, if only due to not shooting yourself in the foot by default with namespaces.
The problem with XML isn't that it supports schemas and namespaces. The problem is that it forces everyone to pay the costs of those things up-front, even if they don't use or care about them.
And because nobody wants to pay the namespace cost upfront, we all end up with vin, vin_no, viNumber, vinNum etc instead of iso3779:vin, for example. Every time we work with more than one API, it's prudent to try aligning the terms used by those APIs.
In my experience you have to have the duplication before you factor it out; trying to design those terms up front when you have no use cases, or even one use case, is doomed to failure. So you're always going to need to start with something non-standard, figure out the standard based on that experience, and then migrate to align to the standard.
Right, this is pretty much exactly what I meant by XML started with all the complexity that other things will add.
I will happily cede that data is a huge area where "duck typing" is far and away the correct choice. Taxonomies of data fail all the time. With odd rules that are largely defined by their exceptions.
XML is overwrought, YAML is a foot-bazooka, JSON lacks comments, TOML gets unwieldy with more than a few levels of nesting, EDN is neat but kind of obscure...
Maybe we need a new configuration format...
ducks
Actually I think something in the space of JSON5, Jsonnet, GCL, HCL, CUE, etc, will be the one to win out in the long run. JSON-but-fix-most-of-the-warts.
Then there's things which almost sit between fully declarative and turing-complete. Maybe Dhall, if it picks up a few more language implementations. Or CEL.
+1 for EDN. In terms of new formats, I'm quite fond of RON, it hits a similar sweetspot in flexibility. JSON5 is nice too, mostly because it's popular.
IMO Jsonnet, CUE, etc are just in a different category of complexity since they include code and require a full-blown interpreter to read.
Using Terraform/HCL as a pre-processor to generate any of the other formats seems like the best now. It is
- No footguns (compared to the alternatives)
- Basic text-templating
- Can do structural manipulations of data. As opposed to many other config-generators that only
do string-interpolation of text. Helm and j2.... kill me please, who ever thought this was a good idea
- Basic looping, transformations and function calls. Without falling into the full imperative
programming language trap. Configuration is still deterministic and directed.
- Supports loading data from other sources
- Supports comments
No it's YAML's fault - the fact that type coercion was once part of the serialization spec and not part of an adjacent schema spec is a tragic mistake. Now there's a bunch of outdated parsers that exhibit the issue and some that have it fixed. Meaning even more inconsistency and frustration. In an ideal world, YAML just shouldn't exist.
Even having any type coercion is a huge mistake. A true/false shouldn’t accept 0 or “false” with quotes or False with a capital f. It should be true or false. A single line string should accept one kind of quotes and one way of escaping.
Similarly in xml it’s a curse that
<Foo><Bar>1</Bar></Foo>
and
<Foo Bar=”1” />
are semantically the same but considered different in most systems.
Just avoid ambiguity and make it impossible to face a choice. The same semantic meaning should have one expression.
Oof. That's YAML 1.1. Need to use a 1.2 loader, which should have much better defaults around types. No more Norway problem unless explicitly opted into. We should badger^Wencourage our libraries to stop relying on a spec that's 14 years and multiple versions out of date.
The norway problem is way overstated, people didn't learn YAMLs reserved words and got mad when it did what it said on the tin. If you
know the type specify the type and YAML will happily err when you put a square peg in a round hole. Better yet actually validate the types when you ingest a document! Pfft who would bother with that? Recent versions of Ansible will now force you to do it right and not accept numbers when it expects a string. I believe the Prometheus ecosystem does it as well.
When it comes to config formats this is pretty reasonable. People are just primed to expect true/false to be reserved.
---
enable_frobulator: yes
use_thopojog: no
This was also the issue presented in the OP where they were mad that their version which is a string was interpreted as a number because it wasn't in quotes. Like I don't know how you expected YAML to fix that for you. If you don't quote a number in TOML it'll also be wrong.
>>“I need to configure this server and the server needs to know if this value is true or false.”
>No, that’s bad. Don’t do that. That’s not a good use for XML.
First, the primary case that people use YAML for is not appropriate for XML? I don't agree with this but, taking it on face value, it creates a YAML strawman to attack.
Second, the rest of the argument boils down to "I don't like how floats are parsed in my language of choice". Guess what? You're storing your version strings in the wrong data type. Stop storing semver and it's variants in floating point types, that's not what they're for.
Finally, this is ultimately a "considered harmful" article, which is always a red flag.
This is one of the few HN discussions where I feel a little bit qualified to give an opinion :)
Two years ago I started a small data quality checker software where users could define their alerts, frequencies,.. all in config files instead of modifying code.
I initially chose JSON as config format, but then realised comments are necessary to guide users in defining alerts. I moved to YAML, but after some "indentation incidents" started using HOCON conf [0] and never looked back. I don't see any reason for choosing YAML over one of JSON or HOCON, except being forced to because of some dependency. Features such as inheritance and text block support which were essential for me are nicely supported in HOCON.
hey! would be cool to chat about what you've build. we are currently building Keep (https://github.com/keephq/keep) where you can define alerts as YAML's. would be cool to learn from you.
XML is well structured, typed and general purpose, but it's a pain for humans to edit. I think the editing pain is one of the main reasons people look for XML alternatives. Personally I wish XML was written a lot more like HAML.
XML is a data format for correct exchange of information between computer systems. Human-friendly editing was never the no 1 goal for its designers. And with good tools like IntelliJ IDEA, I have easier time editing pom.xml configurations than k8s YAML files, to be honest.
Nah, I have an easier time editing XML in a dumb text editor than YAML with the best tooling I've ever found for it.
In fact, I have an easier time with any of its competitors people post around than with YAML. What is distressing, because the language has clearly the goal of being easy to write.
Structured editors have been around since the beginning of XML but they have been slow to catch on generally and have never been that popular because people don’t feel in control.
Ah, kinda. Yes, they existed, but they frequently didn't have many of the more standard features like autocomplete (particularly on both sides of a new tag) and on-the-fly structure parsing that makes much of the painful parts easier to work with. I think we take for granted how good our modern approaches to IDE assistance can be.
Ever use the earliest IDE incarnations of Eclipse on a PC still measured in hundreds of Mhz, or when RAM was still in the dozens of MB? That was the world XML found itself in during its inception.
Electric tags in emacs make it mostly not a problem, and most editors have some equivalent: ‘it’ and ‘at’ motions in vim as well as vim-surround and emmet.
But, I agree that this is one of the worst parts of XML
If a majority of programming nerds weren't so averse to building UI to facilitate more complex actions with better guardrails, we'd probably be better off in many ways. Our desire to not standardize anything beyond "human readable ASCII-like" has held us back, imo.
Xcode's plist editing ability is a mild improvement over manipulating XML text directly, but could use more obvious shortcuts/hotkeys. Even writing XML in an IDE like IntelliJ isn't great, autocomplete should do a lot more.
Why would you need text-like XML if you get a good tool that converts to a more useful data entry format? Wouldn't you just use more efficient binary encoding for storage/interchange, while converting to something user-UI-friendly?
At least I've never had production errors because of malformed xml. Yaml, though.. A misplaced dash and your list of a single item is suddenly two distinct items instead.
While I definitely agree yaml is pointlessly prone to this ("NO" lol), I've had plenty of xml issues. Bad manual attribute encoding, duplicate attributes (pretty easy when there are lots) that vary in behavior depending on your xml reader, using text content instead of attributes, using lists of text nodes as a map that then gets deduplicated inconsistently instead of attributes (to work around needing to manually encode complex or multi-line text into an attribute)...
Humans can screw up anything. And more text often allows it to hide for longer.
Sure, if you define and use it. Same as yaml schemas (they exist!).
XML has them sorta built in (basically all (notable) libraries support them), but it's not like it's required or somehow innately protected because of that. It's just a bit easier to adopt.
But my case was with a yaml validated to a schema. A kube ingress file with a list of rules. My additional dash made it a new entry, in practice allowing everything. With xml it would have been very explicit that I now accidentally had made two rules.
I believe XML has it baked directly into the official spec. JSONSchema is great, but I wouldn't call it a standard yet, and anecdotally I haven't seen it as often as I'd like in use (i.e. I either need to run kubectl --dry-run or use a separate third party solution to validate my changed yaml).
Pro tip: remember that all JSON is valid YAML. You can put JSON in a .yaml file and be just fine. I find that handy when the explicitness makes it easier to read the file.
I once had an issue where something failed in prod but not in test, it was because a MAC address was dynamic and in prod only consisted of numbers so whatever tool we used parsed the Yaml value as a sexagesimal number and threw a type error. Yaml can be interesting…
Note that the behavior you describe (like the “NO” problem) was part of the YAML 1.1 excessive-effort-at-DWIM insanity (in this case, intended to make time entry “just work”) that was removed in YAML 1.2.
I wrote more about it on my blog: https://blog.kronis.dev/articles/ever-wanted-to-read-thousan... but the gist of it is that I had to parse thousands of blog feeds and some article from 2009 included a SOH control sequence inside of otherwise valid XML and this broke everything, until I added additional error handling.
Malformed xml sneaks through the kitchen window and poisons you in your sleep. Someone puts naively escaped umlauts (ä -> ä ) into the surname field and your entire xml parser borks with unknown entity errors
I have. A ton of vendors and APIs generate XML with interpolation or some other form of hand-rolled code that’s incorrectly escaped, which compliant parsers can’t parse and they can’t fix.
I would rather say that XML is untyped (an XML document only contains text of various flavours), and that is why the only concrete example given works in XML but would fail in YAML or JSON:
“test this against Go 1.20”
> It interprets that as Go 1.2.
Well, understand your data format before complaining. YAML, like JSON, understands data types. If you were to write { value: 1.20 } in JSON, it would similarly be interpreted as the numeric value 6/5. The only reason this works "magically" in XML is because XML itself doesn't have data types, it only has text, and the interpretation is left to the user rather than done by the parser.
I don't understand why you are being downvoted. You're absolutely right, though I'd nitpick and say not say "untyped" but that the only types that XML gives you are dom nodes and strings.
Contrast that with JSON, which provides booleans, numbers, and the list and object composite types. (side rant: as a standard JSON does not define whether it's numbers are integers, floats, or decimals! a conforming implementation can use whatever type it wants).
Yes, XML has (multiple) schema definition languages that can be used to enforce that the strings can be coerced into specific types, but XML itself conveys no type information in-band about the values. I think this is one of the reasons it is difficult read and write by hand.
Nice, I wasn't aware of that. But it doesn't change the argument much: the XML document still only contains text data, and it's the schema validation phase that's responsible for converting the data into the correct format. Validating an XML document is an optional step, and I'm not aware of many tools that use XML as their config that perform full schema validation.
To add, the XML specification on decimal data types [0] explicitly says: Precision is not reflected in this value space; the number 2.0 is not distinct from the number 2.00 -- so a decimal data type in an XML document would have the exact same problem as the YAML example in TFA; the only difference is that with XML, the authors of the tool would have to actively shoot themselves in the foot by annotating that element as a decimal type rather than text.
The problem isn't that XML's decimal type doesn't distinguish between 1.2 and 1.20 - it's that versions aren't decimals in the first place.
The fact that versions often contain numbers separated by decimal points or that often times, versions only have two components or that minor versions may rarely exceed 9 for a particular product are merely coincidences.
My full sentence is more like saying Java is untyped if it were possible to run Java source files from the AST while skipping the type validation step, which seems pretty much a truism to me.
This is fine for me, but I cannot force others to do the same, I use emacs, have good eyes and ability to parse XML;
now most of people I work with use less than ideal IDEs.
looks like json without commas. i like that. but i don't like that the outer keywords don't have a colon while the inner ones do. it feels inconsistent.
My motivation is expressing docker-compose.yaml in the "best" way I can imagine, as a design exercise. In that case, I'd rather have a bunch of (service ...) forms, instead of a "services" object. Not sure why. Maybe it's a pun on interpreter-driven formats like that used by [guix][1].
Configs for any mildly complex piece of software ends up looking like:
30% implicit defaults that are waiting to break when you upgrade something
25% boilerplate
20% derivable from some other configuration but you just gotta set them all explicitly
10% cargo culted in through copy-pasting from the last project
5% can only ever be set to one particular value or nothing works
5% silently ignored due to typos - (some of these are causing subtle bugs and some are preventing them)
3% critical to actually doing the thing
2% used to have an effect but that was 3 versions ago
Of these flags and values only 40% have meaningful and correct documentation and 66% are different in staging and production but we are 85% sure that's ok.
Yes XML, YAML, and JSON have a ton of warts, but a good part of the problem is how we layer configurability into software systems in the first place. The serialization format can only be blamed so much.
The primary problem with YAML is that it allows writing strings without quotes. This causes confusion because it’s impossible to see, from just reading the YAML file, which strings become strings and which are interpreted as numbers or booleans.
For example, the identifier blah is interpreted as a string; but 01 is interpreted as a number, which makes it equivalent to 1. Similarly the identifier nope is a string while no is the Boolean literal false, and there’s no clear indicator that the former is a string while the latter is a Boolean.
I do think XML gets the bad rap from its abusive usage in RPC/services. Using XML in those cases resulted in bloat. But, from a configuration language or markup language point of view, I can hardly find something that offer: schema for validation, IDE with auto-completion (I used to use Eclipse and it filled out the enum, attributes very well), comments, and raw string data (use the CDATA tag) gracefully like XML. Go on, prove me wrong.
For something like rpc where data is automatically encoded/decoded then there are definitely better alternatives if you don’t need string representation of the entire blob. Eg: protobuf has a schema and code generation for most languages that can then work with whatever editor you use to write code.
I think editing xml is just as bad as editing anything else and is only made better by having some kind of plugin which understands xml + a specific schema.
It is, as it should be. I'm struggling to format it nicely here in HN, but I believe that the formatting show here [0] is more readable than JSON or YAML. It gives you type flexibility (instead of JSON's string-only keys). What's the source of your confusion here?
Per that style guide, the above map should be formatted like this:
{:a 1
2 :bar
[1 2 3] :baz}
Most maps written in EDN have keys of a consistent type. A map whose keys are consistently keywords would look like this when formatted:
{:a 1
:bar "https://example.com/"
:baz [1 2 3]}
Or like this, when condensed to one line:
{:a 1, :bar "https://example.com/", :baz [1 2 3]}
The first map had keys of three different types: keyword `:a`, integer `2`, and vector `[1 2 3]`. Why would one want a format that supports maps with mixed-type keys? As a contrived example, mixed-type keys let you define a sparse 2D tile-based map for a game that is indexed by x and y coordinates:
The confusion is that the order of the key and value can be reversed, and commas are optional, so there's no way to tell at a glance what e.g. the value associated with:bar is.
This is the case with any hash-map in any language, modulo some of them making certain keys illegal.
But "string":"string" can be reversed in pretty much any language's hash-map.
>here's no way to tell at a glance what e.g. the value associated with:bar is.
You've certainly constructed and formatted a hash-map that is a little tricky to understand, but 1> this combination of key and value types is going to be pretty rare in the wild and 2> to the extent it exists, people would format it to be easier to parse, either by adding commas or newlines.
I use YAML for config files, and I think that's what most people use it for. So, I thought this was going to be an argument for using XML for config files, but it's just bashing YAML.
YAML is easy to read like TOML or INI, has comments unlike JSON, and has dictionaries unlike TOML or INI. It's not bad.
It's not YAML's fault they didn't quote their numerical strings.
> It's not YAML's fault they didn't quote their numerical strings.
Yes, it is. If you design a format in such a way that type parsing is ambiguous, or in this case "keep trying to parse it over and over, starting with the most restrictive option and working your way down", people are going to commit errors. That's just life.
I would absolutely love it if YAML parsers supported a mode where quoting your strings was required. Or, while I'm wishing: a new YAML version that requires that (even though that's impossible to do backward-compatibly without yucky things like having to specify the YAML version in the document itself).
But I do still use YAML for config files, because there's enough about the other options that I don't like even more than YAML.
I for one appreciated the observation that if you're using XML for something that isn't a document, you're probably doing it wrong.
I recall (perhaps inaccurately) seeing that notion somewhere in http://www.catb.org/~esr/writings/taoup/ and it was at once a shock yet seemed so obvious. Might have been in something else esr wrote, but I've always considered that one to be his masterpiece.
Replying to my own comment. Just read through the TOML spec for the first time in years and its tables are essentially dictionaries. Don't know if that was added later or if I just missed it earlier. I'll consider using TOML for future projects. That said, I still think some of the hate YAML gets is unwarranted.
Why would anyone use it for serialization, when json exists? Yaml is harder to parse and the implementations have a history of security issues because it's so complicated. If you're not writing it by hand why bother at all?
It's a perfectly reasonable serialization format and works fine as such. It does not work well as a configuration format, because humans are not supposed to write it by hand.
I think a lot of YAML hate comes from the systems that use YAML (all those things that configure virtual machines, containers, cloud assets, etc.) that have bad data models to begin with and are part of bad architectures. (e.g. Hashicorp seems to come out with a new product every week to fix the problems with their old products)
I totally agree. I've used YAML for small config files on various projects I've written over the years. Stuff where it's maybe 50 lines on the extreme end, with a very simple data layout that doesn't get more than 2-3 levels deep. And for years and years I never understood WTF people were on about when they said how awful YAML is.
Then I used k8s for the first time. And after that, I understood, because k8s manifests are an abomination. Super error prone and hard to read. And I think that this has a whole lot more to do with the data model than the markup format. YAML isn't perfect (in particular, the way you specify an array of dicts is hot garbage and confusing), but it's not the main problem. The way the data is laid out in k8s would be awful to work with in any format.
While we're on spicy takes, k8s use of Yaml is an abomination and should have been vetoed early on. The data model they're using is too hierarchical and might have been better off in XML or Json
I think helm's use of YAML makes it worse because it uses a text template library to manipulate structured data, and thus requires annoying stuff like {{indent}} and {{nindent}} when you want to insert objects. To me, whenever I find myself using one of these indent functions I'm reminded of the well known pitfalls of using regex to parse HTML.
I don't know what I'd prefer, exactly. Probably JSON, or better yet JSON5.
Yeah, I often go that route when embedding objects in YAML. I haven't gone full JSON yet, but it's very tempting. A downside to that is having to convert back to YAML if you need support or want to easily diff against configs other folks have shared.
YAML bashing is sooo tiresome. Yes it has quirks and tries to do too much. And yet JSON and TOML have their own broken idiosyncrasies as well. But in the end they are all trivially interchangeable with each other for 99% of their use cases.
But XML is not even a comparable standard. XML addresses an entirely different problem space. It cannot be directly mapped into baseline common data structures in most other languages like JSON/YAML/TOML.
Saying XML is “better” than YAML requires defining your problem space. Otherwise it just looks like you don’t really understand the difference between the two.
Without actual YAML (“test this against Go 1.20” YAML would interpret as the string “test this against Go 1.20”), its hard to know the actual complaint here, but it sounds like there was a YAML file with something like:
testTarget:
- language: Go
- version: 1.20
Where the tool interpreting expected a version string in “version”, but also accepted a number and implicitly converted the number to a version string. This is not a YAML problem, this is a “code that accepts and implicitly converts invalid data problem”. Its true that a schema and validating parser would help with this, and YAML doesn’t have a broadly supported standard schema language, but I bet the real problem here was that the underlying tool was written in JavaScript, as its the main popular language where even the most naive attempt to parse what you expect to be a string value would not fail if the value was actually a number.
> This is not a YAML problem, this is a “code that accepts and implicitly converts invalid data problem”.
No, this is a YAML problem, because YAML is specified to allow for unquoted strings, and then it uses heuristics to decide if you meant a string or a number or (god forbid) a boolean.
So the "code that accepts..." that you're talking about is literally every conforming YAML parser out there. And they do that because the spec tells them to. So yes, it is a YAML problem.
And to top that off, most of the examples and tutorials you'll find on how to use YAML don't quote their strings. I get the idea: fewer characters to type, more human readable. But god it's a minefield.
YAML is far to helpful to convert bare strings to what it thinks is right (like 1.20 to 1.2 or no to false). It could be helped by having data validation, but the fundamental issue is that YAML does not make it clear to either human or machine what is meant or expressed.
The issue is not isolated to JS either, the same could have happened in python, php, or any other untyped language.
> YAML is far to helpful to convert bare strings to what it thinks is right (like 1.20 to 1.2 or no to false
Versions of the YAML standard from the last 14 years don't do the later, and supporting numbers as a basic data type is, honestly, a weird thing to harp on as “too far”.
> The issue is not isolated to JS either, the same could have happened in python, php, or any other untyped language.
Indexing an associative array of runtimes that is keyed by a string, or doing almost anything else that expects a string, when you get a number instead, will fail with an error in most dynamically typed languages including Python and PHP; much fewer string-expcting operating operations (and particularly not object indexing) will fail in JS with a number.
That looks like an annoying problem. Really need a decimal type for that.
As an aside has anyone tried making Python object notation? For instance you'd get tuples, sets, complex, hex, etc. along with dicts, lists, and all the other stuff json has. I know you can use literal eval but it has some security issues json doesn't have.
Realistically, a decimal type with defined precision is just a coverup for the real problem which is that version numbers aren't decimal numbers. If "1.2" < "1.19" < "1.20" that's not a decimal number and treating or storing it like one is a problem. I've seen this time and time again in software projects where someone treats the version number as a decimal, which works great right up until they have more than 9 minor versions, someone adds a bugfix version or someone adds a beta/alpha suffix. Then a whole bunch of stuff breaks that was making bad assumptions.
I find YAML to suck having used it in the serverless framework and many of my terrible bugs came from it. However, json is better than xml I think as a compromise between the two.
Good points re: XML and its misuse as anything other than a markup language (its in its name, afterall). After using things like HAML and whatnot for a few years I went back to plain HTML. I like it much better.
YAML, meh, I choose to use it in Hugo because that's what I'm used to and I'd rather not learn a new config language until I'm forced to. I prefer config to just be in the language I'm working in, though many people disagree for various reasons but, you know, like, whatever, man.
The main complain about XML (in the config file context) seems to be the difficulty of editing it "by hand".
If think about it, thats not a problem of the format but of the editors people typically use. Simple text editors used in a command line use a line paradigm rather than a tree paradigm. Thats not well suited and its not the fault of the format.
Given that trees is about as fundamemtal as it gets as a data structure maybe what is really needed to end these format wars is not a new format but a new editor or editor plugin.
This. There is no use writing "X is better than Y" without specifying what metric(s) you're using. If it's about human legibility or editability, YAML is better than XML. On the other hand, I would never use YAML as a machine-to-machine data interchange format; for those, JSON or XML are superior.
I also had to laugh at this statement:
the YAML specification has all these features that nobody ever uses, because they’re really confusing, and hard, and you can include documents inside of other documents, with references and stuff
That's a pretty funky argument to use in favour of XML.
As I am fond of saying, there's a common misquotation that runs "YAML is easy for humans to read". The full quote is "YAML is easy for humans to read wrongly".
KISS XML is the best, so glad to see people like the author coming around to it. JSON is fine for JS (since JS isn't likely to ever be gone from our lives) but YAMl, as the author said, never has a case the makes it worth dealing with. Yet, YAML is everywhere. I even use it in my personal Spring Boot projects it's become so ingrained.
> Also the YAML specification has all these features that nobody ever uses, because they’re really confusing, and hard, and you can include documents inside of other documents, with references and stuff
XML also has these?!? In fact I'd guess YAML was developed to mirror the feature set of XML while having nicer syntax for humans.
"Better" is a loaded word. Behind it there is a lot of criteria, priorities, ways to measure, subjetivity and more. Your context for that word may be different than mine.
For those wanting human readable and editable configuration files, NestedText might be a solution. It only supports strings, but conversions can be done during import anyway.
NestedText already is the way I use YAML; everything is intepreted as a string. I have some trust in my YAML parser to not mangle most strings. I could use NestedText, but users would be unfamiliar with it, and IIRC the only parsers are in Python. But then I could use StrictYaml too https://github.com/crdoconnor/strictyaml
It’s not just user error – it’s a problem with the YAML format and culture. The YAML format allows most strings to be written without quotes. And YAML culture, as communicated through most YAML snippets in documentation and YAML files in the wild, is to take advantage of this by only adding quotes when necessary. That makes certain categories of user error much more likely. Someone might write `09` and confirm it’s a string, then change `9` to `7` and not notice that their string has turned into a number.
The JSON format avoids this problem because it requires that all strings be quoted. If you start with `["09", "08"]` and change one of the digits inside the quotes, the string will stay a string. In JSON, strings becoming not-strings – removal of quotes or addition of backslash escaping – is more obvious than in YAML.
“I need to configure this server and the server needs to know if this value is true or false.”
No, that’s bad. Don’t do that. That’s not a good use for XML.
But on the other hand, if you need to mark something bold, then XML is a great choice.
I see both cases as being quite equivalent: whether the server needs to know that a value is true/false or that it needs to display something as bold feels quite the same thing, isn't it?
Or does he mean that you cannot easily retrieve the value of an arbitrary XML element? Whereas to display a document you just process it sequentially and do not need to 'retrieve' arbitrary data. Is this it?
I think I'm with the author on this. The two scenarios are very different. Bold text is probably a span of text within a larger block of text: delimiting it with an opening and closing tag makes sense. A config value is a boolean switch.
This makes sense:
<element>Some text here. <bold>Some more text here.</bold> And yet further text here.</element>
Oh, I was only saying that the article isn't saying that you cannot easily retrieve the value of an arbitrary XML element. It simply putting together selected examples that should deserve YAML and convince us that we must swear XML true.
Well, life isn't black and white, but somewhat greyscale. From my experience, none is better and they don't share the same use-cases.
It's born out of being a mark-up language, but it's true strength is as a data exchange format that can be self-documenting and human-readable at the same time.
Yes, it has a tendency to make people's eyes bleed with how verbose it can be. But if you were to open an XML file without knowing its format, or what each entry is supposed to be... it is likely you will understand (or be able to suss out) the purpose of the data being exchanged.
Schemas are very powerful, well-documented, and can be used to validate that an XML document is well-formed. People who tend to hate XML and Schemas tend not to understand how or why this is needed.
Contrary to the article's one point about config: I believe one of XML's strength is as configuration documents. It's better suited to configuration that is meant to be shared between different projects or platforms. If you only need a configuration file for a few keys and values - you're probably going to be fine with anything else.
But if you need to dump a configuration file into a document to be read by another program or exchange, then you're better off using XML because that format is more tolerant and can be well-documented by using Schemas. You can version those schemas, and use XSLT to convert the configuration document into another format that is far more readable.
That said, XML is not an be-all and end-all format for everyone. I believe, more importantly than anything, in using the appropriate tool for the given task. XML is simply not the best tool for every task.
So every xml file I’ve met has this boiler plate line at the top, something like:
<?xml version="1.0" encoding="UTF-8"?>
But more often than not, it also has a url.
But so does every json file I get from Azure. The other day I exported a power app and each app parameter/variable has its own folder with an xml file and a json file all, which each reference versioned schema and all sorts of stuff all to essentially just say “key: value”
It is known as the XML declaration. It is not specific to any particular organization or group but is a standard part of the XML format itself. This declaration serves several important purposes:
Version Information: It specifies the version of the XML standard being used. In this case, it's version 1.0, which is the most common version of XML.
Character Encoding: It specifies the character encoding being used in the document. In this case, it's UTF-8, which is a widely used encoding that can represent a vast range of characters from different languages and character sets.
Standards Compliance: It signals that the document adheres to the XML standard. This declaration helps parsers and software that process XML documents to interpret and handle the document correctly.
Interoperability: By including this declaration, creators of XML documents ensure that their documents can be correctly interpreted and processed by a wide range of XML tools, libraries, and parsers.
In essence, the XML declaration helps ensure that XML documents are self-describing and can be processed consistently by different software and systems. It's a crucial part of the XML standard and is included at the beginning of most XML documents to set the context for how the document should be handled.
XML is very much misunderstood and this article is not an exception.
XML is a notational tool. A notational tool is a tool for a human to write something by hand and then process with a computer. The important part here is “by hand.” E.g. I’m processing a corpus of someone’s letters and see a phrase like “on Monday I saw him the last time” and I know that I need to mark “on Monday,” “I,” and “him” with references to a date and people I have inferred from elsewhere, and the only way I can do this is by hand, there is no automation. Yet once I place those references, I can mechanically index the corpus, which is my goal.
But writing a configuration file is the same task. Here again I am doing a thing that in the general case has to be done by a human, but the goal is to process the result mechanically.
All other uses of XML are a misuse. Yes, you can use it for data interchange, but it is similar to programmatically calling a third-party tool via a command line. Possible, but involves much overhead and is way less convenient than using that tool via a library. If the data are not generally composed by hand, then they should not be in XML. (But I would argue they should not be in JSON or YAML either; we do this mostly because we have no suitable tools.)
As a notational tool XML is actually rather good. Yes, it is verbose. You know what is not verbose? A special language you create for your special case. It will outperform any generic notation out there. If you decide to do that, then you will have to write a parser, process the text and get an abstract syntax tree. But note that XML is an abstract syntax tree. It is the intermediate result you get if you decide to solve your case in a perfect way. So maybe you could start with XML, get the syntax tree right, write the code to process it, and then see if you still want a parser.
On XML verbosity: what if we compared not the overall length, but the number of syntactic symbols?
I’ve extracted syntactic symbols on the right and you can see XML is much quieter.
On XML not being mappable to objects and arrays: it is straightforward if you remember that the data are supposed to be a syntax tree. They are indeed far from the final representation in the same way an arithmetic expression is far from the code that will evaluate it.
"So if you want to write YAML, you can. But it’ll just take that YAML and turn it into JSON behind the scenes. Then they also have a specific Caddy language. So you can give it the Caddy language and then it turns that into JSON behind the scenes. And you can give it an NGINX config and it’ll turn that into JSON behind the scenes. If you have the cycles and time to spare, that’s probably the best solution for most people…"
The fact that JSON doesn't support comments is so annoying, and I always thought that Douglas Crockford's rationale for this basically made no sense ("They can be misused!" - like, so what, nearly anything can be misused. So without support for comments e.g. in package.json files I have to do even worse hacky workaround bullshit like "__some_field_comment": "this is my comment"). There is of course jsonc and JSON5 but the fact that it's not supported everywhere means 10 years later we still can't write comments in package.json (there is https://github.com/npm/npm/issues/4482 and about a million related issues).
We actually offer both for configuration files of our software - YAML can be very pleasant to write and read for small datasets or those without much depth, without the overhead of curly brackets and quotation marks on everything. The original problem these formats solve is that people just want a simple nestable key-value structure/format, and both deliver on it - unlike xml. (when do you use an attribute vs a tag and text element?)
I know the history of xml, but just to get back to my original premise: why has a tag (a kind of key in the mind of most readers) multiple "kinds" of values/children in parallel, especially one set that is map-shaped (attributes) while the other is array-shaped (proper children)?
Not quite unambiguous! It declares how to parse numbers, but of course the "reference implementation" of JSON (insofar as there is one) is Javascript, in which numbers are floats and hence e.g. are bounded in precision, so the spec actually disagrees with the reference implementation.
I don't think this is true. The original specification (ECMA 404) only covers the formal language, i.e. what are valid documents and not what their memory representation after deserialisation is. A parser that rejects a number because it has too many digits would not be conforming.
rfc8259 is more detailed and it mentions that most parsers will have limited precision and references ieee754 to explain what the expected precision after deserialisation should be for most parsers.
It is, but then you just use JSON, and can ignore YAMLs design issues and broken design, plus benefit from the mass more readily available and faster parsers!
I recently started a new small software project, and, out of habit, reached for YAML as the configuration file format. This article reminded me of the ways that I do really dislike YAML, though it didn't mention what is to me a more annoying parsing oddity:
country: no
YAML will interpret that "no" as a boolean "false". Which... c'mon.
So I started thinking... maybe I should use something else instead.
TOML? Nah, my config file requires around 4 levels of nesting, and nesting in TOML isn't great, at least when done the more idiomatic way.
JSON? No, I want comments, and I hate having to double-quote all the keys.
XML? No, way too verbose; the author of the article goes into more detail why XML is bad for a configuration language.
HOCON? I've used it in some Scala projects, but I'm a little worried it's not mainstream enough and users might be confused at the syntax (or annoyed that they need to learn a new format, even if it's simple).
CUE? I'd never heard of it before it was mentioned in this article. "Validate, define, and use dynamic and text-based data" -- um, that sounds scary. I don't want a language, I want static, declarative configuration.
Do. Not. Write. Yaml. By. Hand. It is a serialization format. That means it is meant to be written to by a program, not a human. You will screw it up, a number of ways, if you write it by hand. It is only designed to be human-readable, not writeable. It is not a configuration format, it is a data serialization format.
I feel like we need that in giant bold red letters at the top of yaml.org. Nobody gets it.
YAML is the configuration format that is most comfortable to write by hand IMO. Plenty of big projects use it for that usecase. YAML also has plenty of different, redundant ways to represent the same data, which it wouldn't if it was a serialization format, no?
Besides, JSON is already human-readable when well formatted. Just lacking in hand-typing ergonomics.
I don't get what makes YAML a serialization format. And if it was intended to be such, then it sucks even more than most people argue it to.
You should probably use a search engine to discover what serialization is, and what a data serialization format is. It's a useful computer science concept.
Ignorance is only bliss until it gets you in trouble.
And yet you haven't presented alternatives. It's easy to dunk on something, harder to be constructive, I guess.
You may disagree with my reasons for not liking the others I mentioned, but that's just your opinion, and when I'm writing my own software, my opinion holds more weight.
What’s great about xmlstarlet is you can do a quick and dirty ad-hoc transformation on the command line, and when that starts getting too elaborate, you can generate the equivalent XSLT and use it anywhere.
Whenever this kind of arguments come up, I am sad that RON (https://github.com/ron-rs/ron) is not better known. To me it feels like a cleaner and better JSON.
In any case, my little experience with it had made me hate YAML. Generally speaking, I have come to dislike any language with significant whitespace other than Haskell.
> It’s just too error-prone, there’s too many things… You just have to always quote everything.
Sorry, but this seems like a very silly, petty argument. There are other reasons to not like today's pervasive use of YAML, but this is not a very compelling one imo.
The same is true in XML, practically speaking, there just isn't another option - everything must be quoted or wrapped in tags.
So just ... use quotes in YAML. You can force it through a linter [0] if the optionality is what you're hung up on.
It's an abuse of Ycombinator servers that I can't paste a Billion Laughs statement that just fills up your parser with NONONONONONONONO for the next sixtillion eternities.
Because NONONONONONO . .
OK seriously. Seriously. These people all chillin' here in the future and talking about how great XML is need to jump in my DeLorean back to 2002. Or maybe try using XML for a few years. Or decades.
* Schemas break XML *all the time*[1],
* "XML-aware" diff/merge[4],
* No Such Thing As Line Breaks[2],
* Sneaky proprietary entities[3],
* NAMESPACES,
* "1NF? Is that a sex thing?",
* Computability[4],
* FRICKIN CHARSETS,
* asemantic but pretends to have semantics,
* Hierarchy Fetishism
And so so so much more. The combined effect of this is that it reduces the volume of the tool ecosystem for a given XML spec. Don't believe me? Run the metrics on gitlab/github/npm/pip/DaSEA[5].
So the tool ecosystem is - very often - only as big as a singular project. It's one of the reasons there's so many XML editor vendors. In S1000D, it's pretty common to have a special vendor for each project.
XML completely nukes, by its essential nature, any possibility of using standard tools. There is an entire category of emergent technology - Lightweight Markup Languages - that were invented, by individuals, working for free, for no other reason than to get out of XML.
Using XML at scale is something that should never happen, for any reason, ever. It's this horrifying perfect storm of non-technical academics steering a crew of malcontents still angry at how GML went. Ol' Linus was a bit of a butthole, but he was right on the money when it came to XML. ALL of the problems YML has that are mentioned - they can ALL be found in XML, but they're magnified times fifty bazillion because of the inherent lack of support.
[1] Leading whitespace in attributes? HOW CHARMING. Yeah, that's in a schema, a very popular one. So each schema is its own language, and both DITA and S1000D allow for virtually any level of customization on top of that, and on top of THAT, in S1000D, you have to contend with each of the Issues. Seriously, it's a flashback to the pre-ATA100/JASC 1930s "shop manual" systems.
[2] No such thing as "normalized" when it comes to XML whitespace, which means no lines, no tabs, no spaces. Everything is elements. Oh, ha ha, unless there's dual-mode DTD/XSD validation . . which should REALLY have its own bullet. Do you realize, in any way, how incredibly radioactive external entities in a internet-facing parser are?
[3] REVBARS! Oh, and FRICKIN CGMs. Good luck processing those, because they were golden tickets handed out to ISO-favored software vendors.
[4] Infinite arbitrary nesting combined with whitespace agnostic means it's REALLY hard to make any sort of compute optimization unless you load the WHOLE thing into memory. An xml-aware git repository has performance several orders of magnitude worse than a normal one, and if used in quantity with goofball schemas, you can actually choke a Bitbucket CLI.
Man, I wish I didn't need DTDs. Unfortunately, the USAF TMCR says I do. Verbatim. TO-00-5-3.
Yeah, in retrospect Billion Laughs was a bit of a cheap shot. It is, however, hilarious. And no one ever put forward any sort of mitigation or fix, for decades[1]. Meanwhile, in the YAML dev world open issues . .
if (refDepth > maxRefCount && node.kind === Yaml.Kind.ANCHOR_REF) {
I don't really have a dog in the YAML fight - apart from Asciidoctor-pdf template files[1] - but the YML people are patching, and the XML people didn't, for a very long time.
Why is that? I'm going to go back to the basic notion of XML as the Everything for Everything, which was encouraged by its design pattern insistence on fake semantics. YML has, no doubt, a big ol' dose of the same sickness, but with a lot less overhead, and it makes maintenance easier.
Keep in mind, we're now debating "How YML is perhaps just as bad as XML"
[1] This has resulted in a lot of software and IETM files (even whole devices) getting pulled from USN vessels in theatre; there's more than a few vulnerabilities that ride on the SGML/DTD Billion Laughs. Bunch of other ancient file formats getting the same treatment, something we in the industry saw coming since 2007. Just a ticking bomb until you fight a peer.
Not a whole lot. And not just XSD, there's nothing either SGML or XML do that can't be done, fifty times faster, with fewer keystrokes, on standard - i.e., commodity, open - tooling, in Asciidoc (as it's deployed) or "Markdown" (with extensions).
USAF hasn't yet gotten nailed with the DTD attacks the way the USN was[0]. And that was a complete musterfluck. First they pulled all the handheld maintenance devices, then they basically mandated that all the stuff getting stuffed into entities could instead get shoved into a black box XML element stuffed full of Base64 or reference to an external binary or - hell - whatever you want. That's the current solution: the //multimedia element.
You'd think USAAF and USAA[1] would have learned something from this . .
[0] That's changing as we speak; DIA has a bunch of hardass new IT policies rolling out. God be praised.
[1] Although the USAA spec has more flex in it when it comes to geometry and other extremely specific rendering behaviors. It's much easier to optimize because it's not insistent that a frickin PDF parts catalog have draftsman-perfect line art.
Here's what the entities (specifically, CGM, the 800 lb gorilla of external entity references) do that can't be done in XML+SVG: ISO/IEC CGM:1999 line types (your dashed lines are exactly right); ISO/IEC CGM:1999 nurbs (so that the curves are just right). I have a bunch of counterarguments to these things and more, but the easiest one is : how much is a perfect dashed line worth? Is it worth twenty two million dollars? Because that's what it cost the Navy. That's assuming the PLAAF/PLAN doesn't hop inside your maintenance network off the east coast of Taiwan. Then you can buy your dashes at the reasonable cost of a few hundred dead sailors.
They need to swap out //multimedia for a standardized, text-based format yesterday, though. Either that or release an ISO profile for SVG, which honestly would be, like, a week's worth of work at most . . if you wanted to see it done, of course. Oh ISO Technical Steering, you and your loveable scamps made up almost entirely of stoneage software industry reps.
The problem with YAML as a configuration format can be boiled down to two things:
1) significant whitespace; that one is obviously subjective
2) encoding types in the document
When I’m reading the `maxItems` property from it, I know it’s an integer. Why do I need the document author to also tell me it’s an integer?
For statically typed languages, you have to tell it what type you’re expecting, so it can interpret the value at that point. `config.getInt("maxItems")`.
Even for dynamic language, most of the time you still want to validate the types up front to avoid an incorrect type blowing up in a random place in your code. So write a schema, use the schema to drive how the values are interpreted.
By replacing all the non-structural types with string, you can have the clean quote less format they’re after but without any of the “Norwegian problem” issues.
> When I’m reading the `maxItems` property from it, I know it’s an integer. Why do I need the document author to also tell me it’s an integer?
Because this allows well-designed client libraries to detect conflicts between the document-author intent and programmer intent, enabling mismatches to fail as errors rather than being read other-than-as-intended.
> For statically typed languages, you have to tell it what type you’re expecting, so it can interpret the value at that point.
Yes and a client in such a language should fail (or have a mode in which it fails) if the value is not actually, as read in YAML semantics, a type compatible with what you are asking for.
> Even for dynamic language, most of the time you still want to validate the types up front to avoid an incorrect type blowing up in a random place in your code. So write a schema, use the schema to drive how the values are interpreted.
Schemas are for validation, not interpretation. If you use them for interpretation, then you get JavaScript-esque weak typing, and the errors that come with it (basically, magnifying the kind of YAML 1.1 problems that YAML 1.2 tamped down.)
> By replacing all the non-structural types with string, you can have the clean quote less format they’re after but without any of the “Norwegian problem” issues.
What you propose is an explosion of Norway-problem-style potential for values to be interpret other than as intended by the document author, not a mitigation.
It's a bit rich having the whole case hung up to the fact that YAML interprets numbers as numbers and not as strings. What when your go version moves to 1.20.1 ...? You'll have to stringify it. Iow version members should always be strings.
The problem is that document, serialization and configuration formats are different usecases and need different languages.
We learnt that XML was a bit too verbose for serialization and moved to JSON. Now we need a good configuration format, especially for the advanced usecases, and YAML ain't it. The known ambiguities are a minor thing - the real problem is that it's not typed enough and significant whitespace. We need a language optimized for custom datatypes. That's why properly used XML is actually better here (it's better typed), but it's far from optimal. We could really use a different option.
I don't need to hear you out, xml *is* better than yaml. Yaml is just a random descriptor that can be ruined by a typo. xml has xslt and xsl, and what have you. It just happens to have a shape that went out of favor because of monstrosities like ESBs and Spring and people have PTSD. I can easily validate and transform my XML just by writing more XML which is awesome. This is why I don't use yaml if I can help it. If I want some descriptor JSON is still better since I can validate it with JSON Schema, or whatever validator I choose out of many. Yaml on most platforms is barely supported.
Nothing like a good old type safe compiled language to cut down on the verbosity, copy paste usage, silly syntax errors, weird undocumented you just have to know the magical incantations, etc. Kotlin or similar languages are the way to go. Much safer, more compact, easier to cut down on the copy paste reuse (which is just miserable drudgery), easy to introduce some sane abstractions where that makes sense. You get auto completion. And if it compiles, it's likely to just work.
People keep on moving around the deck chairs on the proverbial Titanic when it comes to configuration languages. Substituting yaml for json or toml just moves the problems. And substituting those with XML just introduces other issues and only marginally improves things. Well formed xml is nice. But so is well formed json. Schemas help, if the urls don't 404 and you have tools that can actually do something with them. Which, as it turns out is mostly not a thing in practice. And without that, it's just repetitive bloat. XML with schemas becomes very hard to read quickly.
There's a reason, people started ignoring XML once json became popular: json does most of the essential stuff well enough that XML just isn't worth the effort. And if you have something where you'd actually need the complexity of XML, it's likely to be some really ugly bloated kind of thing where the last thing you'd want to do is edit it manually.
I've dealt with cloudformation in XML form at some point in my life. It sucks. Not just a little bit. It's an absolute piss poor format for a thing like that. Since such a thing was lacking at the time, we ended up actually building our own little tools to generate that xml. Hand editing it was just too painful. One mistake could corrupt your entire stack. And it takes ages to find out if you actually got it right. In Json form it's hardly any better. It's just one of those convoluted over-engineered things. Anyway, Json support for cloudformation was not there at the time and the difference is like asking whether you'd preferred to be shot or stabbed. It's going to suck either way.
Gonna date myself but, for ~80% of configuration needs good old INI never let me down. Named sections of name value pairs with string and number data types (maybe bools too. do not recall). et voilà!
Honestly if the config layout fits in ini, there's no reason not to use it.
YAML is great for human-readable config, and there are footguns.
XML is terrible for human-readable config. JSON is not great for human-readable config. TOML is okay for human-readable config.
The problem is there's no clean way to abstract strings, bools, lists, objects, trees, etc. into a human-readable configuration syntax that does not have a footgun.
> The problem is there's no clean way to abstract strings, bools, lists, objects, trees, etc. into a human-readable configuration syntax that does not have a footgun.
My favorite take on this by far is that any types beyond string, list, and dict don't belong in the format at all, and should be left to the ingesting code. I started on the path thanks to StrictYAML and found a home with NestedText.
INI is great for some flatter types of configuration (and I still use it too), but once you need a little bit of nesting, or lists, INI starts to get cumbersome.
I recently developed a reporting feature to a system. The report had to be portable, both human and machine readable, and the human-part had to look nice to management people.
I considered generating 2 separate reports, in JSON for machines and in PDF for humans. The PDF part turned out to be difficult.
In the end I settled with XML. Its machine readable by nature and with XSLT it becomes human readable in the browser. The programming language provided XML encoder in the standard library which made the task very easy.
I don’t care which format is used for manual edits so long as it’s strict (no flimsy type coercions, possible to auto-format, possible to use a schema), and supports comments.
For wire/data formats I don’t care at all. Perf/size and library support is the only factor there but xml or json will do up to the point when I can use protobuf or something similar. I don’t like the conventions/opinionated-ness of obit though. The absence of a set of things is usually not the same as an empty set of things, for example.
I don't really know why Yaml got so popular. It has tons of footguns, and it's being misused as custom "programming language" making all the yaml soup unreadable.
It got popular because it uses significant whitespace, and a whole generation of developers who grew up on Python refused to touch anything that didn't.
It got popular because it's far easier to read and edit hierarchical and array data in YAML than it is in XML (or JSON).
Most of the time I don't need a format with support for some esoteric charset or namespaces or any of the other baggage that comes with XML, I just need a quick way to exchange regular, structured information, possibly with comments, in UTF-8, that is easy to read and write.
Show me a better format that doesn't make my eyes bleed from the angle brackets (or the curly braces from JSON) that is commonly supported across lots of programming environments and I'll happily switch to it.
Yaml does not have footguns. It has dumb humans who use it as a configuration format, when it is a data serialization format.
Using a car engine as a pizza oven would also result in problems if you tried to cook a pie with it. But having a toxic half cooked pizza doesn't mean the engine has footguns. The engine works fine if you use it to drive to a pizza place to pick up a pizza.
That person is using it wrong. They are writing the file by hand, violating the spec. Don't write yaml by hand. It is not designed to be used that way.
There's not much evidence in this article. Its always worked well for me and being more readable saves a lot of time and helps me better visualize and comprehend what I'm working on. If there are more glitches like the version issue the writer ran into I have never noticed that issue or others significantly impact me. Maybe its just my infrastructure engineering use case hasn't run in to them but I need more evidence before I'll grab the pitchforks with you.
In these types of configurations, _everything_ should be a string and data types are parsed with an additional helper function. It was a mistake to have per-platform, per-implementation parsing.
> Now, I understand why people do YAML, but there are better choices.
JSON. XML is great, but a little too verbose and can be difficult to parse. JSON is simple, well structured, easily human readable (if a sane structure is used), agnostic to indentation.
I don‘t understand why HOCON (https://github.com/lightbend/config/blob/main/HOCON.md) isn‘t used more often (at least for configuration use cases). It‘s a superset of JSON, has comments, multiline strings, optional quotes, replacement syntax. We use it at many places, and it‘s as nice at it can get.
The cognitive load for yaml is infinitely lower than XML or other things that “work” better.
The example give about go 1.2 vs 1.20 is a bad one. Go should have gone with 1.02 instead of 1.2. Also you can just convert that into a string.
The reason why yaml is great is because I’m tired of learning “better” solutions that only fix the tiny percentage of issues that yaml has, but has a much more tremendous learning curve. XML is overengineered to the point that it’s a confusing mess.
This is like saying apples are better than oranges.
One is a document markup language. The other is a data notation.
YAML is better understood as a family of notations at this point. For every gripe about the standard YAML implementation, there is a "safe" implementation that does not have that problem. You don't have the throw the readability baby out with the footgun bathwater.
OK, I heard him out and I didn't really hear much of an argument so much as handwaving about how YAML is bad. He's provided exactly one edge case where it didn't do what he expected because (I guess, if I understood the context) he used a number type where a string would have worked better. Which serialization format is immune to those issues?
Except he didn't use a number type, or at least didn't think he was: YAML allows you to write strings without quoting them, and has heuristics to decide if an unquoted thing is a string or something else.
For another example: YAML 1.1 would treat "yes and "no" as boolean true and false. I've heard this called the "Norway Problem":
country: no
Whoops. I had never actually considered the issue in the article, and I wonder if I've ever made a similar mistake with numbers and didn't realize it.
Honestly I think I would be fine with YAML if strings were required to be quoted. Yes, that would make things a tiny bit more verbose, but it would remove a big footgun.
Yeah that probably would have been better. Can’t really change it now though. I think everyone who uses a lot of YAML is aware of this issue and would know exactly what happened immediately; it’s up there with Java string compares by reference in terms of “everyone using the tech knows this pitfall.”
I wrote a dns management system some time ago and used xml to describe the base data. I’m hard pressed to know of a better format where you can hand-edit, easily diff, have schema validation of data, and reliably transform the data into other formats (in this case bind, isc dhcp, and documentation with graphviz+html)
OTOH, I’ve seen xml schemas that would make a rattlesnake cry.
Hi I'm a singer.
My name is:
Paul my name is
My last name is:
McCartney my last name is
My children are:
My child is:
Her name is:
Heather her name is
Her age is:
15 her age is
...my child is
...my children are
...I am singer
Years ago when parsing XML in Java, to my surprise at the time, the parser by default would try to resolve external DTDs while parsing, ouch, what a way to let someone DDoS your system.
Unfortunately YAML was even worse in that regard, as it allowed arbitrary code execution as seen in recent CVEs...
In terms of being able to extend the worlds most common markup language: it’s 2023 and you can make a <cat> element in your HTML and style it with cat{color:brown;}
We got there. We got the extensible markup tooling we needed. Gosh, it took a long time though.
> And I think the fact that now everybody is writing React… and they’re writing it with JSX… and JSX is basically just inline XML… I think that shows that there are cases where actually XML is pretty good
YAML and XML both get their reputation from the software that use them poorly. XML is a pretty decent markup format. YAML is a decent serialization format. Both are brittle when used as a human edited configuration file format.
So it really depends on your usecase. Do you need to be able to import several independently developed vocab and use them, possibly namespaced, in a single document... Seriously, go XML.
CUE does not integrate with XML yet because of these beastly features, in particular how to handle attributes on an object when it also has nested content. It's basically the same problem of how you would transform XML into yaml or json, though there are more options in CUE
I do regret somewhat the name calling. But if I were to publish an article with such incredible arrogance and pretension as to judge two of the major techs of the decade, and did so without applying either correctly, I would welcome some name calling. I think the intolerance to this is something of the neurotypical.
I like to set chatgpt to agressive, accusatory mode when I ask it technical questions. Its funnily angry for nothing but helps you anyway (while despising you).
I think our industry's distraction with config file formats is due to trying to find a reasonable balance between human readable and machine readable. XML was great for machines, YAML is OK for humans, JSON/TOML are somewhere in the middle.
>Does this dude not understand what a data serialization format is, yet is trying to tell people how to design applications?
"Data serialization format" is irrelevant, both XML and YAML are designed for this.
It's just that XML was just over-designed (with all the auxilliary specs), and YAML was designed badly from the start, even for mere configuration.
If someone thinks the refuted his points or preference because they "understand what a data serialization format is", they are really not understand XML/YAML and the domains they're actually used.
SGML might not have been designed for data serialization (it was designed for document authoring), but the XML standard committees and company representatives were heavily interested in generic data serialization.
And YAML was also designed for human readable editing and serialization of configuration among other data serialization needs. "Configuration files" is literally the first use case for it mentioned in the standard's homepage:
"Even though its potential is virtually boundless, YAML was specifically created to work well for common use cases such as: configuration files, log files, interprocess messaging, cross-langauge data sharing, object persistence and debugging of complex data structures".
But the point is moot. What each language was "designed for" is irrelevant to what it's predominantly used for throughtout the industry.
People found a use for them, regardless of whether their designers might or might not foreseen it (they had), and for this use which is what interests people, there are certain issues.
If the answer was as easy as "just use a language better designed for that use case and it will solve your issues", they would have done it already.
Some have their hands tied because vendors/projects/etc they use, enforce TOML or YAML or XML and have to use them too to work with them. Others find alternatives worses for their use case, but still don't consider the one they use (TOML/YAML/XML/etc) as optimal.
I get that XML is about as sexy as mainframes, and that a lot of folks here probably have PTSD from working with Java/Spring web apps, but YAML is about the worst of all worlds.
Though I think the real problem is that real-world configuration files are way too complicated for a simple/dumb/logic-less representation like a .ini/.conf file, so someone thinks to add some logic to is - which is just config-as-code. In a terrible programming language.
If you want config-as-code (and you want to!), just do it properly and use a proper programming language for it. Don't care which one, be it JavaScript, Python, Go, PDP-11 Assembly, or Rust. But please stop with these half-measure DSLs that just don't cut it.