I understand it's a holywar topic, but what's so bad about XML? I'd say it's a v...

jacobolus · on July 5, 2015

XML is a very complex spec which is difficult to implement properly, a heavy format with high storage overhead, which is extremely expensive to parse or process, but also too verbose and finicky to be pleasant for human editing. It doesn’t have built-in standard support for most of the common data types you want in a structured document, so they are all stored as strings or sequences of tags, and then parsed out in an ad-hoc way by each tool built on top. Its namespace feature is ineffective and often a potential security vulnerability. Its separation between attributes and elements is handled arbitrarily by various XML-derived formats and tools, usually inconsistently within the same format. It has terrible support for big arrays of numeric or other binary data. Etc. Etc.

XML, like SGML, is plausibly reasonable when you have something like a word processor document or web page, but is wholly inappropriate for almost every other use.

Notice that despite its acute limitations, JSON ended up as the metaformat of choice for most Web APIs.

Mikhail_Edoshin · on July 5, 2015

I usually save web pages as XPS or PDF, so I can compare the lengths. XML 1.0 specs is 56 pages; by contrast, YAML 3.0 spec with similar formatting is 96 pages. And XML specs describes both the serialization format and simple grammar-based validation for the resulting high-level language (DTD); YAML only describes serialization.

XML is relatively verbose, but this is by design and is clearly stated as design goal #10: "Terseness in XML markup is of minimal importance."

The grammar for XML serialization itself clearly has 1-character lookahead structure, so the parser must be deterministic and thus work in linear time. The tools that process XML (e.g. XML Schema, XPath or XSLT) are based on tree automata and, in most cases, work in linear time as well. (Of course, one can end up with a slow XSLT, I meet them all the time, but one can end up with a slow regex too.)

XML Schema provides very good types and a way to define your own types. I admit this part is relatively complex, but I think it's inherent complexity. If you have a Schema-aware parser, you'll get all the usual types (numbers, dates) and even more so, plus a better (more powerful) formal description of the high-level language than DTD. (For example, DTD requires all structures to have different names, while Schema can define context-aware types.) And Relax-NG is even more powerful. This extra description power doesn't increase the runtime complexity though, it's still linear time.

I don't know what you mean by namespaces being ineffective or vulnerable; I'd say it's as good as it gets for an extensible framework of roll-your-own languages without central authority.

The structure of a particular XML-based format (i.e. tag names, use of attributes, etc.) is the responsibility of the author of this format. Yes, some are very sloppy and illogical, but a lot of code is, regardless of the language.

I agree about huge arrays; XML was never meant to handle them. But modern tools perform very well on moderate and even large amounts of data; a few hundred megabytes is not a problem at all.

XML is not just plausibly reasonable for word processing or web documents, it's the only format designed to handle such (mixed) content.

There is some shortage of tools, most state-of-art tools now are Java-based and this doesn't work for everyone. But the biggest problem with XML is the amount of FUD and prejudice that accompanies nearly every mention of it.

EdiX · on July 5, 2015

>what's so bad about XML?

That no programming language deals natively with XML's data structure. That's why xpath and xslt needed to be invented. This suggests that XML's data structure is not actually a good mapping for people's problems.

Mikhail_Edoshin · on July 5, 2015

I don't know about all the landscape, but in Python, at least, with `lxml`, you can configure the parser to yield native Python objects. I.e. you parse a XML file and get your own objects as a result. Here "your own" part is limited to your class and methods (no data, except what is in the element itself), but it's already rather convenient. (I can't say `lxml` is simple and Pythonic though; it's rather cumbersome to boot.)