How to Avoid Being Called a Bozo When Producing XML (2005)

bane · on July 31, 2016

My "favorite" XML formats are the one that are just some kind of weird meta-format and don't really use any of the XML features:

   <format>
      <record id="1">
         <field name="id" value="1"/>
         <field name="name" value="abc">blah blah</field>
         <field name="attribute">this is the attribute value</field>
         <field name="end_of_record" value="True"/>
      </record>
      <record id="2">
      ...
      </record>
   </format>

And yes, these types of abominations are everywhere.

The only way to avoid being called a Bozo when producing XML is to either

a) ensure that humans never had to see this craziness

b) don't use XML

XML as a config file format, in particular, is probably one of the worst ideas in computing.

thom · on Aug 1, 2016

Here is an event from a popular sports data provider's XML format, for your delectation:

    <Event id="524717408" event_id="1" type_id="34" period_id="16" min="0" sec="0" team_id="20" outcome="1" x="0.0" y="0.0" timestamp="2014-11-30T12:29:59.446" last_modified="2014-11-30T13:24:03">
      <Q id="2045368832" qualifier_id="59" value="23, 2, 21, 12, 6, 17, 8, 4, 19, 11, 10, 1, 3, 5, 7, 24, 28, 33" />
      <Q id="1068483434" qualifier_id="227" value="0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0" />
      <Q id="840260679" qualifier_id="197" value="425" />
      <Q id="1586850783" qualifier_id="30" value="40383, 57328, 40146, 54756, 38580, 55605, 17339, 42774, 17784, 62399, 110979, 3673, 80447, 84395, 20452, 49596, 153366, 169359" />
      <Q id="340265857" qualifier_id="44" value="1, 2, 2, 3, 2, 2, 3, 3, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5" />
      <Q id="328261435" qualifier_id="194" value="38580" />
      <Q id="1426777221" qualifier_id="130" value="4" />
      <Q id="293008363" qualifier_id="131" value="1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 0, 0, 0, 0, 0, 0" />
    </Event>

XML and CSV, together at last.

chme · on Aug 1, 2016

Reminds me of the beautiful paths in svg:

     <path
       id="path4136"
       d="m 141.42136,428.08793 c 5.24568,-16.5136 15.9393,-31.24659 30.01423,-41.35166 14.07492,-10.10508 31.45369,-15.52663 48.77766,-15.21688 13.79473,0.24664 27.51957,4.08979 39.39595,11.11168 11.2946,6.67792 20.92213,16.25825 27.27185,27.74057 6.34973,11.48232 9.35256,24.85978 8.08349,37.91934 -0.97817,10.06598 -4.47673,19.87936 -10.10152,28.28427 -7.66405,11.4521 -18.89192,19.94346 -29.67188,28.52721 -10.77995,8.58374 -21.59293,17.81326 -27.90682,30.06164 -5.96111,11.56401 -7.38898,25.49638 -3.38484,37.87491 4.00414,12.37853 13.52214,22.96957 25.6082,27.78501 6.7156,2.67569 13.99861,3.58421 21.2132,4.04061 19.62989,1.24181 39.40632,-0.70279 58.58885,-5.05077 14.7604,-3.34565 29.1633,-8.0984 43.43656,-13.13198 20.00787,-7.05594 40.67497,-15.26376 54.54824,-31.31473 4.77196,-5.52102 8.57644,-11.80437 12.12183,-18.18274 19.76105,-35.55128 32.20013,-75.50916 33.33503,-116.16755 0.65168,-23.34676 -2.46779,-46.99293 -11.11168,-68.69037 -5.01987,-12.60061 -11.95904,-24.5794 -21.47642,-34.24345 -9.51738,-9.66404 -21.73721,-16.92383 -35.09212,-19.29463 -2.34449,-0.4162 -4.71259,-0.68237 -7.07107,-1.01016 -18.71745,-2.6014 -36.77133,-9.07291 -55.55839,-11.11167 -15.41222,-1.67253 -30.96773,-0.32564 -46.46701,0 -8.75302,0.1839 -17.50911,0.0413 -26.26397,0 -19.79058,-0.0933 -39.767,0.35345 -59.02055,4.93345 -19.25355,4.58001 -37.93007,13.61662 -51.08608,28.40158 -10.37036,11.65439 -16.79892,26.19465 -23.23351,40.4061 -3.97708,8.78379 -8.01783,17.53876 -12.12183,26.26397"
       style="opacity:1;fill:#000000;fill-opacity:1;stroke:none;stroke-width:2;stroke-linejoin:miter;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" />

brightball · on July 31, 2016

The biggest advantage of XML was the detailed schema validation. Having a uniform and flexible way to both generate data structures and ensure that their contents was valid before ever attempting to process them was handy.

XML had a lot of warts but most of its strengths are still seeking passable implementations in JSON. Protocol Buffers is probably the closest thing to being standard in that area for schemas and generation. The number of JavaScript templating options out there are trying to fill the XSLT gap.

zamalek · on Aug 1, 2016

It's also that it's extensible (primarily because of namespaces) - you can mix and match schemas so long as one of them uses xs:any. This brings up another way to avoid being called a bozo: namespace your XML. You're throwing a major advantage away if you don't, and if you don't need/understand that advantage then you're better off using a different serialization format.

bane · on Aug 1, 2016

Yeah, I acknowledge that bit.

But I also think the various schema definition languages (DFDs, XSD, whatever) turned out to be either not expressive enough (DFDs), or a complete PIA (XSD) and in the end, they weren't used very often.

Still, it's nice to have them when you need them and when they aren't there it hurts.

Mikhail_Edoshin · on Aug 1, 2016

Relax NG validation is very good; it's more expressive than XSD and it has both XML and non-XML forms and looks very nice.

brightball · on Aug 1, 2016

I'll definitely agree with that. I remember using some Java desktop GUI to generate XSD's rather than typing all that stuff out.

mwfunk · on July 31, 2016

I don't necessarily disagree, except for the last point. I've rarely (never?) encountered XML used as a config file format where users were expected or encouraged to edit that config file directly vs. using other tools or APIs to touch the file.

In those cases, I would rather have XML config files than undocumented binary blobs as config files. When I see an XML config file, I feel a little relief that it's not a binary blob rather than disappointment that it's not freeform text, because I assume that freeform text must've been off the table for whatever reason (which, depending on what the config file is for, can be a totally rational and reasonable thing to do).

I don't work in specialties where XML has a ton of visibility though- maybe there are lots of projects out there that I don't use in which people are required to hand-edit XML config files, as opposed to "it's in XML, so you could edit it directly, but really no one should be modifying the file with a text editor unless the preferred indirect mechanism isn't an option in some specific case".

SomeCallMeTim · on July 31, 2016

>In those cases, I would rather have XML config files than undocumented binary blobs as config files.

False dichotomy.

Better than XML and binary blobs:

* JSON (assuming everyone knows what this is)

* YAML [0]

* Lua tables (if you're already using Lua as a scripting language; Lua started out as a configuration language after all)

* INF format [1] (not my favorite, but pretty easy to parse and much better for humans to read than XML)

* Any of the above compressed with a gzip compatible compression (if size matters, though it rarely does these days)

Even Protocol Buffers [2] are better than XML, though at that point it becomes a "documented binary blob". But as long as the spec is shared, the format can easily be read by just about any programming language.

[0] https://en.wikipedia.org/wiki/YAML

[1] https://en.wikipedia.org/wiki/INF_file

[2] https://en.wikipedia.org/wiki/Protocol_Buffers

flukus · on July 31, 2016

> JSON (assuming everyone knows what this is)

The new .net uses json, it's awful. No comments allowed and it get's pretty unreadable when you have nested configuration elements.

ivl · on Aug 1, 2016

I seriously think the lack of comments is a deal breaker for JSON config files for me. At least with what I'm doing now. I find myself changing configs a ton, and I love being able to simply change which blocks are commented to get what I want, without having to dig anywhere.

SomeCallMeTim · on Aug 2, 2016

I agree...and I found a Gulp plugin that lets me pre-strip comments from my JSON files as part of the build process.

So I use JSON-with-comments, but the app only sees the stripped files.

piaste · on Aug 1, 2016

VSCode uses comments for every line in its settings.json file.

I guess they figured it may not be correct JSON, but since they aren't sending those particular JSON files anywhere it doesn't matter?

Perseids · on Aug 1, 2016

JSON5 [1] is an extension to JSON that allows comments, multi line strings, additional commas at the ends of lists, and more. It has become my preferred config file format.

[1] http://json5.org/

DiabloD3 · on Aug 1, 2016

And instead of Protobufs, Cap'n Proto [1], which was started by one of the principal author behind Protobufs, to fix all the flaws in Protobufs.

[1] https://capnproto.org/

int_handler · on Aug 1, 2016

You can use Protocol Buffers for configs without having to serialize them in its binary format. Protocol Buffers has always had its own text format [0] and now a JSON mapping [1] as well.

The proto text format is actually more flexible and less verbose than JSON since it does not require the outer enclosing set of braces and quotes around all the keys and has support for comments.

Here are a couple of examples of config files using the Protocol Buffers text format:

* Bazel CROSSTOOL: https://github.com/bazelbuild/bazel/blob/master/tools/cpp/CR...

* SyntaxNet: https://github.com/tensorflow/models/blob/master/syntaxnet/s...

[0] https://developers.google.com/protocol-buffers/docs/overview...

[1] https://developers.google.com/protocol-buffers/docs/proto3#j...

andrewflnr · on July 31, 2016

GP is obviously not stupid enough to think that XML and binary are the only options. Their whole point seemed to be that they've seen enough binary blobs in practice that even XML was a welcome step up.

chriswarbo · on Aug 1, 2016

I've always considered YAML to be far too complicated. There are many overlapping/redundant syntax rules for doing the same thing, lots of ways to mess up parsing, etc.

crdoconnor · on Aug 1, 2016

True, but if you turn those "features" off and swap out implicit typing for explicit typing it becomes a much simpler language.

This is what I ended up doing:

https://github.com/crdoconnor/strictyaml

moosingin3space · on Aug 1, 2016

I'd say TOML [0] is the best because it can be a very simple key=value structure, but also supports very detailed, nested structures. It has a 1-to-1 correspondence with JSON, but is more friendly for configuration (comments are a huge help!)

[0]: https://github.com/toml-lang/toml

chme · on Aug 1, 2016

I am missing a config file format / parser&generator lib that preserves every comment and format (empty lines, etc.) after a read/write cycle.

SomeCallMeTim · on Aug 2, 2016

I've written that for Lua files. And I've seen it for XML, to be fair.

bane · on July 31, 2016

I've mostly come across XML config files that are meant to be edited by humans in various programs that use some kind of Java framework as the back-end.

I don't Java much, so I'd be hard pressed to remember the various framework names (Spring maybe?) but I remember at one point writing a Python script to de-XML the config files into something that was just a bunch of key=values, then another Python script to convert it back to the required XML. IIR on that project, the handful of config options that needed to be tweaked were spread across a dozen or so different XML files.

If the framework could have just read a .txt file with key=values in it, config changes would have gone from 10 minutes to 30 seconds. I eventually just wrote a python thing that auto-deployed and configured the entire stack after asking you a couple questions.

It was absurd.

I believe Android development does (used to?) require lots of hand XML editing. Most of which just drives a Java code generator. I guess the tooling is better these days, but it was enough to drive me away.

WatchDog · on Aug 1, 2016

Spring was certainly a very XML focused framework.

However it has had for a long time ways of using property files in conjunction with XML, while you would still need the XML to define your dependencies, you could have a simple property file for runtime configuration.

Thankfully in newer releases and with spring-boot you can avoid XML entirely.

Mikhail_Edoshin · on Aug 1, 2016

Config files are usually plain key-value pairs and, of course, using a whole eXtensible Markup Language for them is kind of overkill. But if your config files are more complex, say, you need a Makefile-like stuff, then XML is more than appropriate.

int_handler · on Aug 1, 2016

There are plenty, especially in the Java and .NET worlds. To name a few:

* Ant/Ivy * Maven * MSBuild * NuGet package configs

YeGoblynQueenne · on Aug 1, 2016

>> I've rarely (never?) encountered XML used as a config file format where users were expected or encouraged to edit that config file directly

I think it's more like, it's a text format (no matter what the op recommends) so it can be edited. If you don't want anyone editing your configuration you don't store it in a text file, right?

Not to mention stuff like pom files that are explicitly meant to be edited by hand. Gods, why?

int_handler · on Aug 1, 2016

I'm sure you will also love JSONx: http://www.ibm.com/support/knowledgecenter/SS9H2Y_7.5.0/com....

bane · on Aug 1, 2016

There....that's a perfect example. Thank you.

jarman · on Aug 2, 2016

Well, if you have to store arbitrary, opaque JS object in XML document, it is much better way than JSON in CDATA

groovy2shoes · on Aug 1, 2016

You're going to "love" this little guy: http://txti.es/barry/xml

Only slightly better is the JSON counterpart: http://txti.es/barry/json

voltagex_ · on July 31, 2016

Microsoft pretty much standardised on XML all through the early .NET framework - app.config and web.config, plus most project files are XML files, and defining your own configuration (past simple key/values) is very tricky and error-prone.

voltagex_ · on Aug 1, 2016

How would you write that example while taking advantage of the XML features you're talking about?

bane · on Aug 1, 2016

I think what you might be asking is "what would more idiomatic XML look like?" And that's a fair question for people who haven't spent lots of time working with XML.

First off, I think attributes are evil. In theory, they're good, but nobody knows how to use them, so they shouldn't ever be part of your XML. They're simply elements with cardinality of 1.

The format would probably be better as:

   <format>
      <record>
         <id>1</id>
         <name>
            abc
            <comment>blah blah</comment>
         </name>
         <attribute>this is the attribute value</attribute>
      </record>
      <record>
      ...
      </record>
   </format>

This is a completely valid XML language, is much more clearer, less verbose, doesn't overload element names, doesn't abuse attributes, etc. etc.

One important thing that most people don't get about XML is that XML is a specification for describing data-interchange formats. XML isn't a format, or a language. The result of following the XML spec is an "XML format".

If somebody asks what format some data format is in, it's more appropriate to say "it's in an XML" rather than "it's in XML".

lmm · on Aug 1, 2016

Well this reduces to something like:

    <abc>
      blah blah
      <attribute>this is the attribute value</field>
    </abc>

The point is not to write a flexible meta-format for expressing arbitrary objects, because XML is already that. Each specific thing you want to express should have its own specific format. That way you can actually use the validation features too.

jroseattle · on July 31, 2016

> Don’t print

> Use an isolated serializer

Some old reference material (XML isn't as common as JSON anymore), but still worthwhile learning: don't output data formats directly. Directly = echo, print, printf,println...whatever your syntax suggests. I see this happen a lot with my junior engineers, and I have this same conversation with them.

Prefer to use data serializers that encapsulate all the syntactical rules that go along with XML, CSV, JSON, YAML, etc. Let the serializers do the grunt work of writing output in correct format.

Some serializers aren't always ideal - correctness and speed can be an issue. Nonetheless, prefer to use those mechanisms over writing your own output.

Freak_NL · on July 31, 2016

It's a classic trap that appears to get software developers caught (even now in 2016!) at that point where you are still lacking that firm grasp of the standard libraries available to you.

einhverfr · on July 31, 2016

Yep. As I usually say, it is better to be able to say "not my problem" than it is to be able to say "invented here."

The single greatest strength of a skilled software engineer is to know when to make it someone else's problem.

lttlrck · on July 31, 2016

Even if it is not your fault it can still be your problem.

einhverfr · on Aug 1, 2016

Which is why judgment in that question is so valuable.

nrser · on July 31, 2016

i think a major problem is that XML kinda looks and feels like HTML (and there was the whole XHTML thing to further confuse), and outputting HTML programmatically (vs string / print / template based) has most been frowned on as overweight and cumbersome.

you come from web dev doing HTML like that and you see XML and think "hey, that looks the same, i'll do it in the same way".

XML is a programmatic data exchange format like JSON or YAML, which most people would never think of outputting as templates or printed text, but it looks and feels like HTML, which most people deal with first and where that's the standard approach.

ams6110 · on July 31, 2016

>YAML, which most people would never think of outputting as templates

Don't tell the Ansible folks!

geerlingguy · on July 31, 2016

Ansible uses Jinja2 to output templates in whatever format is preferred by the thing being configured. I haven't personally seen Ansible used to output YAML... But people will do anything :-P

Ansible does use YAML as a configuration language though—something for which it's perfectly suited.

spdionis · on July 31, 2016

Well, some frameworks use yaml for config files and you might use ansible to write those.

That said the templating is usually trivial, just maybe write some string values.

cytzol · on Aug 1, 2016

I’ve done it. It’s painful enough that it teaches you “don’t do this!”. For example, you need to escape `{{ item }}` as `{{ "{{" }} item {{ "}}" }}`!

flomo · on July 31, 2016

Outputting JS as templates or printed text was pretty common before every language added a handy toJson method though.

nrser · on Aug 1, 2016

yeah, I guess I sorta missed that... I mean the 'X' in AJAX was for XML... I've def been guilty of outputting "XML" with php tags. by the time we got to JSON there were libs available, or maybe we just wrote our own.

cillian64 · on July 31, 2016

Surely it depends where your output is going? Print and friends are ideal for producing human-readable output, especially when it is temporary, for monitoring or debugging. And they are awful for producing stable machine-readable output which you might want to store.

If I'm trying to output straight to a user sitting in front of a terminal, they are going to be very unhappy if I output XML at them. And if my program only outputs machine-readable and requires another layer to turn it into something human-readable, that seems overcomplicated for most applications.

Have I missed the point, or is this advice intended for more specific scenarios than I imagined?

superuser2 · on July 31, 2016

I think the point is to use a serialization library when you are trying to output a structured format rather than writing a half-assed use-case-specific implementation of one.

Print and friends are appropriate when not attempting to produce data that conforms to any particular structured format.

0xcde4c3db · on July 31, 2016

I think the benefit of a serialization library is going to depend on how complex and dynamic your actual output is. I've done XML-by-printing, but in that case the XML elements were fixed scaffolding with no relation to our internal object hierarchy (A containing array of B containing array of C containing array of D, always, regardless of how our application changed). It was also on an embedded system for which adding libraries was kind of painful.

spdionis · on July 31, 2016

If I need to communicate with a couple of external endpoints that need 5-10 lines of mostly static xml amd the templating is simple i often might prefer using a static templated xml file.

It's much easier to understand what's happening later.

zubat · on Aug 1, 2016

I use XML for a combination of features that I consider very important but are also perceived as "overkill": A source syntax that has already handled text escaping and encoding, lets me add some abstract structure, and lets me encode the text in a way that lets me nest different parsing modes for various kinds of structured data.

The first two are easy enough to get with your pick of JSON or S-Expressions. For a lot of things even CSV is enough, although CSV has the downside of being so simple that people opt to write an incorrect toolchain for it themselves instead of adding a dependency.

But it's the last feature that really produces the complexity. Once you get into "I want the inner structure to contain a different and unambiguous semantic meaning from the outer structure" you have a pretty substantial engineering problem. Less structured approaches like JSON or S-Expr's drop the problem on the floor by declaring one universal semantic, making the programmer deal with adding anything else on top. XML's compromises to achieve a more detailed representation of data involve the angle bracket tax, schema languages, etc.

If you want a guarantee that a rich data source can be processed correctly through an n-tier architecture that emits various radically different outputs, these compromises become compelling. I'm a big fan of DocBook, for example, and its canonical toolchain is an XSLT style sheet: The workflow I end up with is initial writing in a light syntax of choice, compile to DocBook XML, add additional formatting and styling in the XML, and then emit the final document in whatever forms needed - HTML, PDF, etc. It's extremely flexible, and you wouldn't get the same quality of result with a less extensive treatment.

For ordinary data serialization problems and one-offs, it is considerably less interesting.

cptskippy · on July 31, 2016

XML is well regarded in the enterprise and languages like JAVA, C#, and VB.NET handle is spectacularly as an exchange format.

I think it's bad reputation comes from anyone not using an enterprise language because the support just isn't there.

I recall working with a partner who we were doing an identity federation with. Our system was using WS-Trust which is a SOAP/XML protocol. It wasn't ideal but everyone seemed to support it ok. These guys were cutting edge though and used Ruby on Rails.

No support for the protocol wasn't a huge deal, just means you have to craft your XML for your SOAP calls yourself. But at the time we were doing this, RoR didn't have SOAP or XML libraries. They had to write everything from the ground up. It sucked for me and I was just fielding rudimentary questions, I can't imagine how painful it must have been for them.

wtbob · on July 31, 2016

> I think it's bad reputation comes from anyone not using an enterprise language because the support just isn't there.

On the contrary, I think that XML's bad reputation comes from the fact that it is <adverbial-particle modifies="#123">so</adverbial-particle> <adverb id="123">incredibly</adverb> <adjective>verbose</adjective>.

Also, the whole child/attribute dichotomy is a huge, huge mistake. I've been recently dealing with the XDG Menu Specification, and it contains a child/attribute design failure, one which would have been far less likely in a less-arcane format.

XML is not bad at making markup languages (and indeed, in those languages attributes make sense); it is poor at making data-transfer languages.

JSON has become popular because a lot of bad programmers saw nothing wrong with calling eval on untrusted input (before JSON.parse was available). It's still more verbose than a data transfer format should be, and people default to using unordered hashes instead of ordered key-value pairs, so it's not ideal.

The best human-readable data transfer format is probably canonical S-expressions; the best binary format would probably be ASN.1, were it not so incredibly arcane. As it is, maybe protobufs are a good binary compromise?

616c · on July 31, 2016

I think the worst of this is what I call semantic incoherence.

I have a system that has things like <Task ID="6">Blah</Task>. Why is the ID, clearly always an integer in every sample of hundreds I see, represented as a string?

Another favorite: <ExecuteCommand>[CDATA[Batchfile.bat]]</ExecuteCommand>, while a binary or something else will be <ExecuteCommand>"program.exe /argument:f /argument2:x"</ExecuteCommand>.

By the way, this is an enterprise as it gets: a software tool from a four-letter hardware company, quite huge, trying to sell off its software division. I wonder why.

XML is like all other "crap" tools: Java, PHP, SOAP: some people do not grok the spirit of the law, and they do weird things that reflect their discomfort and hurried need to operate with it. Many write it off.

I agree with your points, this is just my corrolary. The sad thing is SEXPR and XML are not far removed, one is arguably a subset of the other, and notice how people lose their shit when you ask them to consider Lisp languages for daily works because "all those parens are stupid" and how the culture surrounding a potentially viable tools makes people close up without delving in with curiosity.

https://en.wikipedia.org/wiki/SXML

http://arclanguage.org/item?id=19453

ams6110 · on July 31, 2016

> Why is the ID, clearly always an integer in every sample of hundreds I see, represented as a string

Becase XML is a text-based markup. If you truly want binary data you need to encode it and use CDATA sections.

616c · on July 31, 2016

That was not quite my point.

Why pretend it is a string at all?

I should have been more clear. Sometimes you have these argument type deals <Task ID="3"> where I would at least hope for <Task ID=3> or the monstrosity above (I assume ID=3 is not valid in hindsight, I am getting tired just writing this all on the second pass even!). And I see all different variations in the same XML file! There is no logical consistency, not even in the same config for the same function of this multi-stage system.

I am not even a novice programmer, and I find the variation annoying, and sometimes hard to reason about when I want to know what the hell the programer was thinking.

The valid part for the CDATA portion has changed several times in minor releases, so when our server team upgrades, I get to figure out the new syntax.

I thought XML was proposed to avoid these things! Haha. Again, tools in the hands of "wise men" like me are dangerous. I am probably as ignorant as them, I just think I know better!

oceanswave · on July 31, 2016

Enclosing the attribute within double quotes isn't pre-disposing the value to be of a particular type. It's part of the XML spec that attribute values are contained within double quotes, and must be to be valid. The type isn't implied in the file.

An xml schema such as <xs:element name="Task"> <xs:complexType> <xs:attribute name="ID" type="xs:int" use="required" /> </xs:complexType> </xs:element>

could more explicitly declare the type of the value.

616c · on July 31, 2016

Thanks for the explanation. I guess in this case I learned be careful what you wish for. I guess this is why I prefer the

But this is my ignorance of XML and familiarity with HTML showing.

ams6110 · on July 31, 2016

XML as a config file format was a disaster in every example I ever encountered. Config files are supposed to be editable by humans using editors, and most that I saw were too complex for that. In particular the NeXT/Apple property file formats are horrible abuses of XML.

As a format to represent structured data, it could be fine as long as you were pragmatic about it. In the case of <task id="3"> you either assumed that "id" was always an integer or you validated it with a schema declaration, which quickly got hairy.

In practice I never validated XML beyond it being well-formed (which was provided by default in any parser) and never had any real problems.

thangalin · on July 31, 2016

What takes fewer lines of code to parse?

    <element.name id.value="3.14">

Or accepting both:

    <element.name id.value="3.14">
    <element.name id.value=3.14>

How would you specify an empty value for mandatory attributes?

colejohnson66 · on July 31, 2016

I've seen empty values written as

    <tag attr1 attr2="val">data</tag>

Whether that's legal or not, I don't know.

oceanswave · on July 31, 2016

not valid. wondering if you've seen that within HTML, where it is valid.

colejohnson66 · on July 31, 2016

Actually, now that you mention it, I think it's from Chrome's Inspect Element tool, but I can't check right now.

I think if you wrote something like

    <div class="">...</div>

it would display in the tool as

    <div class>...</div>

kozak · on Aug 1, 2016

Chrome's Inspect Element shows you the non-serialized DOM structure, which means it's neither XML nor HTML at that point.

chiph · on July 31, 2016

Oh, this is the difference between attribute-oriented XML, element-oriented XML, and whatever-the-hell-we-feel-like-oriented XML. Publishers should pick one of the first two and be consistent about it.

ams6110 · on July 31, 2016

Agree. Practical/pragmatic use of XML as a data format requires consistency.

CamperBob2 · on July 31, 2016

I have a system that has things like <Task ID="6">Blah</Task>. Why is the ID, clearly always an integer in every sample of hundreds I see, represented as a string?

You're really asking a different question here: "Why should an integer be used as a task ID?" Storing the task ID as a string may give you options in the future that you wouldn't otherwise have, at a relatively small cost in parsing performance and validation overhead.

Most of the world's regrettable XML schemas were faulty at the specification stage, not the implementation stage. To minimize the likelihood of eventual regret, I usually prefer to store stuff in strings unless there's a very good reason not to. The fact that I'm using XML means that I'm not that concerned about performance, so... strings, it is.

A similar argument can be applied to the child/attribute dilemma. If there's even the slightest chance that a field isn't always going to be a leaf node, I'll do the extra typing and make it a child. Ideally the parser would be written to make them both work the same anyway.

616c · on Aug 1, 2016

I see were you were downvoted, but I happen to see merit with your comment. Again, a lot of people make technical decisions without stepping back and just scanning their choices as non-specialist (in the context of their programming domain) and ask hey, does this make sense?

CamperBob2 · on Aug 1, 2016

Technically all attributes are supposed to be surrounded by quotes regardless of how they're interpreted. That renders the premise of my whole comment invalid, to be "technically correct," so the people downvoting may have had that in mind.

Still, there are plenty of XML applications that leave out the quotes on numeric attributes. My point was really that they're not doing themselves any favors by abusing the spec that way. A text-based markup language is a great example of how premature optimization is unhelpful most of the time.

FunnyLookinHat · on July 31, 2016

> JSON has become popular because a lot of bad programmers saw nothing wrong with calling eval on untrusted input (before JSON.parse was available).

Disagree. JSON became popular because it was extremely easy to implement (both for marshaling and consuming), and because it was extremely lightweight.

I think you could also make the argument that JSON was conceptually easier for programmers to wrap their minds around. You could just pretty-print it and quickly get an idea for the object's format, attributes, etc.

seagreen · on July 31, 2016

I agree, especially with the easy to understand part.

Look how short the standard is: http://www.ecma-international.org/publications/files/ECMA-ST... It's small and perfect, like a 2x1 LEGO block.

Here's the XML spec: https://www.w3.org/TR/REC-xml/ <backs away slowly>

ams6110 · on July 31, 2016

XML could be fairly lightweight also. It was all the enterprisey-standard formats that were hideous.

E.g.

    {"name":"John","age":42}

vs.

    <person name="John" age="42" />

rimantas · on July 31, 2016

Now do the nested objects in both. One line does not show much.

Mikhail_Edoshin · on Aug 1, 2016

    <person id="123" name="John" age="42" sec:checksum="...">
      <family-member type="spouse" ref="456""/>
      <family-member type="child" ref="789" />
      <fin:credit-rating score="A"
          last-change="2016-02-04T12:34:56Z" />
      <уфмс:статус значение="42" />
    </person>

Here we can describe `person/@id` as element ID and `family-member/@ref` as a reference to an ID so our XML tools can link these together.

Also note three more elements from different namespaces: `@sec:checksum` could be some kind of technical information about the record, `fin:credit-rating` is added by the finanical module. The `@last-change` is defined as datetime so as we read it with other XML tools we'll get it as datetime type.

The next one is a tag in Russian language that describes something related to Russia; XML can use the whole Unicode in tag and attribute names.

Also, XML names are globally unique by design so there's no clash between all the different pieces and the tools can easily be configured to ignore parts they don't understand or work as a glue between different areas.

We can still efficiently validate the syntax the whole piece or parts of it as we see fit.

wtbob · on July 31, 2016

> Disagree. JSON became popular because it was extremely easy to implement (both for marshaling and consuming), and because it was extremely lightweight.

A canonical S-expression parser is strictly easier to implement, given that S-expressions consist only of lists and byte sequences (no numbers or objects), and is even more lightweight. JSON's big advantage was that it was familiar to a JavaScript programmer, that's all.

auganov · on July 31, 2016

S-expressions is basically no syntax. Human-readability depends solely on the person that comes up with the schema. I mean there's many reasons to love S-expressions but human-readability is an unusual one. edn [0] is an interesting compromise (as is clojure).

XML is actually IMO not that bad at human readability, it's pretty good. It's terrible at human writability. Conversely S-exps are lovely to work with.

[0] https://github.com/edn-format/edn

MichaelGG · on July 31, 2016

XML's bad rep for verbosity is almost entirely due to the nonsensical, terrible idea of requiring names in the end tag. Without that, it's about the same level of verbosity as JSON. And personally, after writing plenty of both by hand, XML is easier to get right. JSON, with it's poor quoting rules (mandatory quotes on names??) and lack of comments is very annoying to do by hand and seems visually more noisy.

erlehmann_ · on Aug 1, 2016

An advantage of names in end tags is human readability. Consider this XML fragment:

  <a>12<b>34<c>56<d>78<e>90</e></d></c></b></a>

Appending something to the end of the d element is easy, since one can just search for its end tag. In JSON and other formats that only have one single character at the end, one has to count brackets or parentheses for this purpose:

  (12(34(56(78(90)))))

mikeash · on Aug 1, 2016

If they're all <a> then you're back to square one.

JSON solves this with indentation, pretty printing, and using paired symbols that most conpetent editors can automatically balance. This solves the homogeneous case too.

Incidentally, XML can benefit from the first two, and many editors balance tags, so you can get the same thing there.

erlehmann_ · on Aug 1, 2016

It is rare in real-world XML that elements have children with the same type. Do you have a (non-divitis) example where the tags are all the same?

lmm · on Aug 1, 2016

It happens with any tree structure. E.g. I used to work on a system that managed reinsurance contracts and represented them as trees of contracts.

erlehmann_ · on Aug 1, 2016

Did the elements often have immediate child elements that had immediate child elements (and so on) of the same type? Like:

  <contract><contract><contract><contract> […]

lmm · on Aug 1, 2016

No, there were a couple of layers in that case. But that doesn't actually help you add a child at the correct level, because the end of a contract would look something like:

                ...
                </contract>
              </subcontracts>
            </content>          
          </contract>
        </subcontracts>
      </content>
    </contract>

JoshTriplett · on July 31, 2016

> I think that XML's bad reputation comes from the fact that it is <adverbial-particle modifies="#123">so</adverbial-particle> <adverb id="123">incredibly</adverb> <adjective>verbose</adjective>.

> Also, the whole child/attribute dichotomy is a huge, huge mistake.

Those two factors run counter to each other. Attributes decrease verbosity, compared to child elements.

I agree, though. A few changes would make XML closer to ideal: eliminate attributes and eliminate the name in closing tags (<tagname>value</>), which makes child elements much less verbose, and reduces the need for attributes.

wtbob · on July 31, 2016

> A few changes would make XML closer to ideal: eliminate attributes and eliminate the name in closing tags (<tagname>value</>), which makes child elements much less verbose, and reduces the need for attributes.

Then just change '<tagname>' to '(tagname,' and '</>' to ')' and you'll have S-expressions.

Consider this:

    (feed
     (version 1)
     (title "Example Feed")
     (link http://example.org/)
     (updated "2003-12-13T18:30:02Z")
     (author (name "John Doe"))
     (id urn:uuid:60a76c80-d399-11d9-b93C-0003939e0af6)
    
     (entry
      (title "Atom-Powered Robots Run Amok")
      (link http://example.org/2003/12/13/atom03)
      (id urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a)
      (updated "2003-12-13T18:30:02Z")
      (summary "Some text.")))

That is a canonical S-expression (for a Scheme or Common Lisp reader, just quote the URIs too) version of:

    <?xml version="1.0" encoding="utf-8"?>
    <feed xmlns="http://www.w3.org/2005/Atom">
    
    <title>Example Feed</title>
    <link href="http://example.org/"/>
    <updated>2003-12-13T18:30:02Z</updated>
    <author>
    <name>John Doe</name>
    </author>
    <id>urn:uuid:60a76c80-d399-11d9-b93C-0003939e0af6</id>
    
    <entry>
    <title>Atom-Powered Robots Run Amok</title>
    <link href="http://example.org/2003/12/13/atom03"/>
    <id>urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a</id>
    <updated>2003-12-13T18:30:02Z</updated>
    <summary>Some text.</summary>
    </entry>
    
    </feed>

I particularly like how URIs are sometimes encoded as attributes and sometimes as child text elements.

And compare to your proposed version:

    <feed>
    
    <title>Example Feed</>
    <link>http://example.org/</>
    <updated>2003-12-13T18:30:02Z</>
    <author>
    <name>John Doe</>
    </>
    <id>urn:uuid:60a76c80-d399-11d9-b93C-0003939e0af6</>
    
    <entry>
    <title>Atom-Powered Robots Run Amok</>
    <link>http://example.org/2003/12/13/atom03</>
    <id>urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a</>
    <updated>2003-12-13T18:30:02Z</>
    <summary>Some text.</>
    </>
    
    </>

I think it's pretty clear which is the most readable and elegant.

JoshTriplett · on July 31, 2016

If you're going to compare the two fairly, include appropriate indentation for both, not just the S-expression version. Also put the author and name tags on the same line, as you did with the S-expressions:

    <feed>
      <title>Example Feed</>
      <link>http://example.org/</>
      <updated>2003-12-13T18:30:02Z</>
      <author><name>John Doe</></>
      <id>urn:uuid:60a76c80-d399-11d9-b93C-0003939e0af6</>
    
      <entry>
        <title>Atom-Powered Robots Run Amok</>
        <link>http://example.org/2003/12/13/atom03</>
        <id>urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a</>
        <updated>2003-12-13T18:30:02Z</>
        <summary>Some text.</>
      </>
    </>

That said, I like S-expressions too, and I wish more parsers and tools existed for them, such as schemas, query tools, and simple transformation tools.

wtbob · on July 31, 2016

> If you're going to compare the two fairly, include appropriate indentation for both, not just the S-expression version.

When I pasted it in from https://validator.w3.org/feed/docs/atom.html#sampleFeed I guess I lost the indents. No idea why: they are clearly there in the original.

j4_james · on Aug 1, 2016

> I particularly like how URIs are sometimes encoded as attributes and sometimes as child text elements.

I think the distinction here is that the one is an identifier which is not intended to be dereferencable, and the other is a link to a resource which has to be retrievable. In the good old days the id would most likely have been a URN and the link a URL, but that distinction was being discouraged in favour of the more general URI term at the time the Atom spec was developed. [1]

So while they're syntactically both URIs (well technically IRIs), they're functionally quite different. It may be debatable whether that's a good enough reason for the one to be an element value and the other an attribute value, but I don't think that decision was obviously wrong.

[1] https://tools.ietf.org/html/rfc3986#section-1.1.3

erlehmann_ · on Aug 1, 2016

The second and third examples do not have namespaces.

How would you include an HTML summary, for example?

wtbob · on Aug 1, 2016

> The second and third examples do not have namespaces.

> How would you include an HTML summary, for example?

As a text attribute, honestly — which would be necessary in XML as well (you could embed XHTML in XML, but not HTML). And in the general case, embedding one variant of XML inside another, rather than embedding a character-encoded variant of XML inside another, doesn't seem all that useful. How often do transforms need to reach all the way in like that?

I guess it's cool if it's possible, which is why I like S-expressions all the way down. But I don't think it's all that useful, as opposed to neat.

erlehmann_ · on Aug 1, 2016

> How often do transforms need to reach all the way in like that?

In my experience, almost every time XSLT is used on real-world documents, those are documents with multiple namespaces. XSLT stylesheets themselves are also documents that have multiple namespaces. Example: Atom feeds often contain XHTML content. It is a common problem with RSS that it does not specify if the content of an element is HTML or plain text.

I have found that arguments that doubt a feature is necassary from people who can not imagine use cases are almost invariably wrong, while arguments that doubt a feature is necessary from people who list use cases and why they think those are better solved otherwise or even left unsolved are often right. Your post seems like an example of the former; would you say that complex real-world content with namespaces could sway you in favor of them?

lmm · on Aug 1, 2016

I would be convinced if I saw real-world examples where having namespaces gave an advantage over not having namespaces. I can see the value in specifying whether the content of a given node is XHTML or text. I can at least theoretically see value in allowing nesting XHTML without a layer of escaping. I can't see any non-theoretical way in which namespaces are necessary to accomplish these things.

erlehmann_ · on Aug 1, 2016

Example: The XSLT stylesheet for this Atom feed generates a web page for each entry: http://news.dieweltistgarnichtso.net/notes/index.xml In this setup, the Atom XML for each entry is generated from XHTML with XSLT, which makes it possible to automatically include an Atom enclosure element for every XHTML media element. To publish a podcast episode, it is enough to add a post with an <audio> or <video> element, as an XSLT stylesheet can “reach into” the XHTML content.

Namespaces are also widely used in SVG, which uses the XLink specification for hyperlinks and can embed XHTML and MathML content. Since SVG can be embedded in (X)HTML, this means you can have an ATOM feed containing XHTML containing MathML and SVG that contains XHTML and all have it displayed correctly.

lmm · on Aug 1, 2016

> Example: The XSLT stylesheet for this Atom feed generates a web page for each entry: http://news.dieweltistgarnichtso.net/notes/index.xml

> In this setup, the Atom XML for each entry is generated from XHTML with XSLT, which makes it possible to automatically include an Atom enclosure element for every XHTML media element. To publish a podcast episode, it is enough to add a post with an <audio> or <video> element.

Sure. Why do you need namespaces to do that? Why couldn't you do it in XML-without-namespaces (or even JSON and some theoretical JSON-transformation-lanugage?)

> Namespaces are also widely used in SVG, which uses the XLink specification for hyperlinks and can embed XHTML and MathML content.

Again, why are namespaces necessary though? Why not just have a tag whose content is specified to be XHTML/MathML ? Wouldn't you want that anyway for the sake of human readability?

erlehmann_ · on Aug 1, 2016

XML without namespaces does not exist. If it existed, how would you differentiate between title and link elements in Atom and title and link elements in XHTML? They have the same element names, but do not have the same meaning and therefore must be processed differently. Namespaces ensure that any XML processor can know the language of each part of the input.

Namespaces actually are the general mechanism with which you can specify that content is in another language: If you look at the feed source code, you can see that XHTML content is started with <div xmlns="http://www.w3.org/1999/xhtml"> and ends where that div element is closed.

Having an element with the semantics that “this content is in another language” is done out of necessity in HTML, as it has no namespacing: <style> elements contain CSS, <script> elements contain JavaScript, <svg> elements contain SVG … having an element in each language to embed each other language would become complicated very fast.

lmm · on Aug 1, 2016

> XML without namespaces does not exist. If it existed, how would you differentiate between title and link elements in Atom and title and link elements in XHTML?

By where it is in the structure. The document is a tree where each element has well-defined context; there should never be confusion about whether a particular <title> is part of the feed or part of the content in the feed, because if it's in content it will be inside the content tag.

(Don't you need to do that anyway? I mean what if the XHTML had another Atom feed embedded in it? Or the content of one of the entries in the feed was another Atom feed? That's legitimate, but you wouldn't want to show titles from the "inner" feed as titles in the feed).

> Having an element with the semantics that “this content is in another language” is done out of necessity in HTML, as it has no namespacing: <style> elements contain CSS, <script> elements contain JavaScript, <svg> elements contain SVG … having an element in each language to embed each other language would become complicated very fast.

Only if you need the ability to embed an arbitrary other language. And if you do need that you can't possibly be validating or transforming based on what's embedded, so what value is the namespacing of it giving you?

Mikhail_Edoshin · on Aug 1, 2016

You may have incomplete documents (e.g. documents with conditional sections, very much like XSLT):

    <code:if test="...">
      <!-- whatever -->
    <code:else>
      <!-- whatever -->
    </code:if>

Here you'll first process your code part an copy the contents as they are and then process the contents; but in the source document the two languages are interspersed.

Or you may want to extend your text format with, say, literate programming and add code fragments and files. In my homegrown system it's like that:

    <literate:fragment id="..." language="...">
      <text:caption>...</text:caption>
      <literate:code>...</literate:code>
    </literate:fragment>

My text system already has a notion of captions so there's no need to add my own "literate:caption" here. Yet the other two "literate" elements are new an unique. Also, using a namespace here ensures that I'm sure not to have a clash if the base system adds their own "fragment" or "code" blocks.

lmm · on Aug 1, 2016

OK, I guess that takes things a level up. I don't like that kind of interspersed style and I don't think incomplete documents should be the same kind of thing as complete ones (e.g. one can't meaningfully validate your first example, because what if the "whatever" is an element that has to be present exactly once). But I can see that if you want to write things this way then namespaces help.

erlehmann_ · on Aug 2, 2016

“I don't like” seems to be an æsthetic argument, not a technical one.

erlehmann_ · on Aug 1, 2016

> The document is a tree where each element has well-defined context; there should never be confusion about whether a particular <title> is part of the feed or part of the content in the feed, because if it's in content it will be inside the content tag.

In this specific case, maybe – but generally, it is not true that you can infer the namespace of an element from context. Also, elements can have multiple attributes with different namespaces (and often do).

> I mean what if the XHTML had another Atom feed embedded in it? Or the content of one of the entries in the feed was another Atom feed? That's legitimate, but you wouldn't want to show titles from the "inner" feed as titles in the feed

That actually appears to be a bug in my stylesheet. Thank you for bringing it to my attention!

Programs often use namespaces to provide metadata. Here is an SVG I created with Inkscape that uses six different namespaces for metadata: http://daten.dieweltistgarnichtso.net/pics/icons/minetest/mi... Thanks to namespacing, web browsers can display the picture while ignoring Inkscape-specific data.

> Only if you need the ability to embed an arbitrary other language. And if you do need that you can't possibly be validating or transforming based on what's embedded, so what value is the namespacing of it giving you?

It is very useful to embed any arbitrary language, as XML processors can preserve the content they do not understand without processing it. My XSLT stylesheet would have no issue with SVG embedded in XHTML, just as your web browser most likely ignores everything about the SVG linked above it can not understand.

lmm · on Aug 1, 2016

> It is very useful to embed any arbitrary language, as XML processors can preserve the content they do not understand without processing it. My XSLT stylesheet would have no issue with SVG embedded in XHTML, just as your web browser most likely ignores everything about the SVG linked above it can not understand.

Sure, but you can ignore extra attributes in JSON or hypothetical XML-without-namespacing too. I feel like there's an excluded middle here: either the content of a given tag has to be, say, SVG, in which case the validation schema for the outer document could just say (in a structured way) "the content of this tag must be a valid SVG document according to the SVG schema", or the content is some opaque arbitrary XML document, in which case there's no meaningful validation to be done.

Even when working with something like XHTML-with-embedded-SVG, I found myself wishing there was a way to strip the namespaces, run my xpath queries / xslt transformations on the stripped version, and then put the namespaces back; I think I'd've got my actual business tasks done a lot quicker that way.

erlehmann_ · on Aug 2, 2016

Ignoring other attributes in data formats without namespaces is not as easy. What if one language is embedded in another and each one has a title element?

I do not know why you “feel” that way about the middle you want to exclude. It has been proven to be very useful in practice for me. Also without it, XML would not have the “extensible” property.

The way you describe working with “XHTML-with-embedded-SVG” reads to me like there is something about namespaces or your toolchain that you have difficulties with. I found that with XML-based systems, especially XSLT, it is easy to make a task needlessly complicated if one does not understand the details.

Mikhail_Edoshin · on Aug 1, 2016

The creators of XML were aware that it was verbose; they mention in their design goals that this was the least priority.

Child and attribute "dichotomy" is not a mistake. What you mean is that these two samples appear to be equivalent:

    <foo value="123" />
    <foo>123</foo>

But they are not equivalent. The first line (with an attribute) is there solely for the computer. When the document is rendered, the human user is not supposed to see anything there unless the computer adds it.

The second line (with text content) is there for both the computer and the human user. The text "123" is for the human user; the fact that this text is something called "foo" is for the computer. When the document is rendered, the human user will see "123" here. Maybe computer will enhance something or maybe it will just use it as index or reference, whatever.

Most people who don't like XML seem to only encounter it in config files. In config files there's normally no content that needs to be there for the end users, so all data can happily go into attributes. The text content starts to matter when we deal with natural language texts.

wtbob · on Aug 1, 2016

> The creators of XML were aware that it was verbose; they mention in their design goals that this was the least priority.

Which seems pretty wasteful.

> Child and attribute "dichotomy" is not a mistake.

It's not for a markup format — as I mentioned, it can make sense there — but, as you mentioned, it doesn't make sense in a config or data file format.

legulere · on July 31, 2016

The problem is that XML maps badly to data structures in common programming languages. JSON maps perfectly to structs and datastructures as lists/arrays/maps.

S-expressions are good if you work with Lisp like languages, but I don't think they're very readable if you're not into Lisp. I also can't see how they map easily into datastructures of imperative programming languages or even statically typed functional programming languages like haskell.

wtbob · on July 31, 2016

> S-expressions are good if you work with Lisp like languages, but I don't think they're very readable if you're not into Lisp.

Take a look at https://news.ycombinator.com/item?id=12198581; I think it demonstrates how readable one dialect of S-expressions can be.

> I also can't see how they map easily into datastructures of imperative programming languages

JSON consists of numbers, strings, booleans, objects and arrays; canonical S-expressions consist of bytes and lists. I contend that one can easily encode strings, numbers and booleans alike as bytes, and both objects and arrays as lists. Consider:

    {
        "id": 1234,
        "isEnabled": true,
        "props": ["abc", 123, false],
    }

This could be encoded in canonical S-expressions as:

    (object
     (id "1234")
     (is-enabled "true")
     (props (abc "123" "false")))

Granted, one still must convert the strings "1234," "true," "123," and "false" into the expected types, but with JSON one still must check the expected types anyway; it's not that big a difference.

And I honestly think that the S-expression version is far more attractive.

zyxley · on July 31, 2016

You could do make it more like S-expressions in JS if you really wanted.

    {object: [
      {id: "1234"},
      {isEnabled: "true"},
      {props: ["abc", "123", "false"]}]}

Not quite the same, but nothing keeps you from parsing an array of key/value pairs instead of a hash.

wtbob · on Aug 1, 2016

You may not leave JSON object properties unquoted, so it'd have to read:

    {"object": [
      {"id": "1234"},
      {"isEnabled": "true"},
      {"props": ["abc", "123", "false"]}]}

So you have extraneous quotes, extraneous semicolons, extraneous commas, plus the parsing code is complicated by having to handle all of that rather than atoms & lists (that's not a strong reason, since parsing code is written once and used millions of times).

I really, really don't get the visceral opposition to S-expressions. From my perspective they're both better & simpler.

PeterisP · on July 31, 2016

There is a very big difference - "with JSON one still must check the expected types anyway" is not really true, I can deserialize an arbitrary json and I will know the difference between 123 and "123" even if I don't know what's expected or, alternatively, mixed-type values are expected.

wtbob · on Aug 1, 2016

> There is a very big difference - "with JSON one still must check the expected types anyway" is not really true, I can deserialize an arbitrary json and I will know the difference between 123 and "123" even if I don't know what's expected or, alternatively, mixed-type values are expected.

You will still need, in your code, to handle both 123 & "123" (or handle one, and error on the other). That's really no different from, in your code, parsing "123" as an integer, or throwing an error.

In JSON one must check that every value is the type one expects, or throw an error. With canonical S-expressions, one must parse that every value is the type one expects, or throw an error. There's really no difference.

If one is willing to use a Scheme or Common Lisp reader, of course, then numbers &c. are natively supported, at the expense of more quoting of strings (unless one chooses to use symbols …).

lmm · on Aug 1, 2016

> You will still need, in your code, to handle both 123 & "123" (or handle one, and error on the other). That's really no different from, in your code, parsing "123" as an integer, or throwing an error.

It is different because in the latter case you have to write your own code to do it, while in the former your library will handle it for you.

> If one is willing to use a Scheme or Common Lisp reader, of course, then numbers &c. are natively supported, at the expense of more quoting of strings (unless one chooses to use symbols …).

So this format comes in dozens of partially-incompatible variants? Lovely.

ZenoArrow · on July 31, 2016

> "The best human-readable data transfer format is probably canonical S-expressions"

I personally think TOML is a bit more readable...

https://github.com/toml-lang/toml

legulere · on July 31, 2016

For configuration files, not for data serialisation.

ZenoArrow · on Aug 1, 2016

Let's put it like this... what can you express in JSON that you couldn't express in TOML?

lmm · on Aug 1, 2016

I can cleanly parse JSON, serialize it, and be confident I haven't lost anything. That can't be done for a language that allows comments without complicating the AST.

crdoconnor · on Aug 1, 2016

YAML is more readable than TOML though.

einhverfr · on July 31, 2016

XML is very often the least bad format (compared with ASN.1, JSON, X12 EDI, CSV, and other interchange formats), particularly when dealing with statically typed languages. XML is a horrid chimera of SGML but at least it is both human readable, subject to machine validation, and gets the job done.

cptskippy · on Aug 1, 2016

Oh EDI... you made me shudder.

einhverfr · on Aug 1, 2016

No experience with ASN.1?

cptskippy · on Aug 1, 2016

Nope. I ran into EDI when I was doing an integration for a JIT Hub that had to integrate with Hitachi and Seagate inventory systems. It was pretty awful to work with but the protocol was rock solid.

qwertyuiop924 · on July 31, 2016

Well, XML is complicated, so it's hard to build support for, and it's verbose, so it's heavy on the wire. Frankly, I think JSON is a better format in most contexts.

barrkel · on July 31, 2016

The biggest problem with XML is that it's a node labeled tree that makes the schema choice between leaf node and attribute for scalar data almost arbitrary, whereas JSON is an edge labeled tree without the same choice. Most programming languages use edge labeled graphs for in memory data structures, so the semantic distance is lower with JSON.

beagle3 · on July 31, 2016

Indeed. Furthermore, JSON readily differentiates between a single element {a: 'hello'} and a vector with one element {b: ['hello']}, as do most programming languages. XML does not, which leads to weird constructs like <Names><Name>a</Name></Names> to indicate that more than one name is possible. (Except .. if you actually use a schema with your XML parser, that indicates more than one is possible. But almost no one does). JSON also differentiates numbers from strings, etc.

As a result, in my experience, JSON tends to be more robust in real world use - even when a schema is available.

nialo · on July 31, 2016

Can you explain what you mean by JSON being an edge labeled tree in more detail? I don't understand and would really like to.

zachrose · on July 31, 2016

Taking a stab at this...

Let's say we have a dog who has four paws. In XML:

    <dog>
      <paw health="ok">
      <paw health="ok">
      <paw health="ok">
      <paw health="ok">
    </dog>

In JSON:

    { "paws": [
        { "health": "ok" },
        { "health": "ok" },
        { "health": "ok" },
        { "health": "ok" },
    ] }

I think what the GP is getting at is that JSON is always describing the relationships between a thing and another thing, rarely the things themselves. In the JSON version, for example, it can be assumed that an object in the "paws" array is a paw.

This example is sort of a straw man. The JSON version could be wrapped with { "dog": {...} } and the individual XML paws could be wrapped in a <paws> element. But in any case, JSON doesn't need you to give an explicitly label the type of each paw, just what they belong to and what's known about them.

qwertyuiop924 · on Aug 1, 2016

I'll step up to the plate to give a more technical answer.

JSON is an edge labelled tree, XML is node labelled tree. Let's see what that means, but first, let's talk about what nodes, edges, trees, and labels. You may already know, but I don't want to make no assumptions.

First, a tree: A tree is a datastructure with nodes, which reference other nodes, and each node is only referred to by one other node. Now, XML is obviously a tree, with each tag being a node:

       <dog>
       /| |\ 
      / | | \
     /  | |  \
  <paw> | | <paw>
        / \
       /   \
      /     \
    <paw>  <paw>

However JSON is also a tree: however, instead of tags, we have arrays and objects:

Well, actually, I'm not going to draw that. I'm typing on a phone, and it was hard enough making that last one. So, you know, just imagine it. And if you imagine hard enough, you just might notice that this graph is edge labelled, rather than node labelled.

A node, as you may recall, is just a thing on the tree, like a tag, or an object or an array. Queue the music!

  TO THE TUNE OF "NOUNS" FROM SCHOOLHOUSE ROCK:
  Oh any list through which you can go (like a array, a linked list, or an arraylist),
  And any structure that you can show (like a hashmap, or a struct),
  If they have pointers you can follow (from an object in a tree),
  You know they're nodes, you know they're nodes

Aaaanyway, an edge is the link between two nodes, and labels are just names.

JSON labels edges: "I want the first value in the array you got at key "foo" from the root object."

XML labels nodes: "I want the paragraph tag with the id of 'foo' inside the body tag inside the html tag."

You see, with the JSON, the nodes themselves didn't have labels, just the links between them: With XML, it was the opposite: There was no name for the links, instead there were names for the objects.

GGP reminds us that most programming languages do it the same way JSON does (when was the last time referred to the Foo object in the Bar object in the head Baz object when coding?), and so JSON maps better to the kind of datastructures we use most of the time.

MichaelMoser123 · on Aug 1, 2016

I don't quite see why you can't do the same with xml; maybe it needs some more typing, but it is expressing the same thing.

  <paws>
    <health status="ok"/>
    <health status="ok"/>
    <health status="ok"/>
    <health status="ok"/>
  </paws>

i thought that the main advantage of json was that it can be used as is (code as data) in javascript, but the problem here of course is that without a parser/validator one can inject tons of malicious code. If you are not on javascript then you can't do without a parser / in memory tree structure - and that's the same DOM model once again.

Json needs a bit less typing, now is that really such a significant difference? i think that adoption in matters of markup is more like a fashion - once people got the hang of it then it seems natural and goes without explanation.

i would say that there is one major difference - binary or text; as long as its text then it doesn't quite matter how you structure your markup; if you need your data to be of small size then you will have to compress it; however the parsing of a text tree will usually take more time than the serialization of a binary structure (by several factors);

Therefore you will use text markup where application speed is not very important, or where speed of development is more important than application performance, or you will use it for complicated configuration data (and your users will hate you because a name value format like ini files is easier to handle - well, mostly)

lmm · on Aug 1, 2016

> I don't quite see why you can't do the same with xml; maybe it needs some more typing, but it is expressing the same thing.

It's not idiomatic though - the dog's paws aren't "healths". The point is that in XML each tag is expected to have a label and be an entity in its own right, whereas in JSON you expect each field to be an attribute.

barrkel · on Aug 1, 2016

Graph theory terminology: nodes are connected by edges. When drawn, the edges are the lines, the nodes are the blobs that are connected by lines.

All trees are graphs. Not all graphs are trees; there could be cycles, or children with multiple parents in an arbitrary graph.

In JSON, the nodes are literals: numbers, strings, booleans, arrays, hashes (object constructors). The edges are hash keys (object field names in the constructors).

In XML, the nodes themselves have the names. The edges are implicit in the syntax via containment, and are unlabeled.

In programming languages, generally our values don't have names. Instead, our variables have names, and refer to values; variables can be assigned different values, but the name doesn't change. More physically, if the values are stored on the heap, variables are pointers to values on the heap, and fields of heap objects are further pointers to more values on the heap. Here, variables and fields are edges, and the values are the nodes. Looked at from a graph theory perspective, the in-memory model is an edge labeled graph.

qwertyuiop924 · on July 31, 2016

Indeed. Even translating to Lisp, which has closer datastructures than most languages, the XML is translated to an edge-labeled tree.

marktangotango · on July 31, 2016

Great explanation, I'd never heard it put that way before, thanks!

chrisseaton · on July 31, 2016

What do you think JAVA stands for? It's not an abbreviation. It's the name of an island and it's just 'Java'.

Cpoll · on July 31, 2016

It's more accurate to say that it's named for the coffee beans that come from said island.

trav4225 · on July 31, 2016

It's even more accurate to say that it's named for the coffee made from the coffee beans that come from said island. ;-)

djur · on July 31, 2016

Ruby has had an included XML library since before Rails was released. soap4r is older than Rails too. I wrote my share of clients for SOAP services back then. soap4r wasn't fun to use but it mostly worked. If the service was really simple (a single call and response, for instance) it was sometimes more expedient to put together the request yourself.

When Savon came out 6-7 years ago it was a huge relief. Luckily, by that point, I was seeing a lot less SOAP. But even with Savon, the experience was only lifted to "not awful", never to "wow, I'm glad they used SOAP, this is so easy."

saurik · on July 31, 2016

My experience with early Ruby XML parsers is that they were all "how hard can this be?" hacks someone did over a weekend by people who didn't really use XML or understand the ecosystem of specifiations and thereby barely worked and often didn't support fundamental things like namespaces correctly. It took away everything which made XML powerful and left you with something that was often even finicky.

flomo · on Aug 1, 2016

Yeah, I worked on a Rails 1.x app which integrated with an xml feed. IIRC, the options were a library that used regexes internally and had horrible performance (but might have been fine for rss feeds), and another library which wrapped a C library and used callbacks[1]. Definitely was a huge pain point and probably was a mistake for me to use the latest hep environment for that app, but for the rest of Rails it might have been worth it.

[1] I think it might have been http://www.yoshidam.net/Ruby.html#xmlparser

SomeCallMeTim · on July 31, 2016

>well regarded in the enterprise

I think this alone should be enough to cast doubt on it, based on my (albeit limited) interactions with "enterprise" software.

>I think it's bad reputation comes from anyone not using an enterprise language because the support just isn't there.

What, like JavaScript? I've had to read and write XML packets from a Node app to work with (surprise!) an enterprise app. I had probably 20 choices of libraries with varying levels of features, and the one I chose worked fine.

I was lucky, compared to some of the others on this page: The "RPC"-style XML commands and responses I had to parse and generate were all well standardized, so I just wrote a wrapper that extracted the completely opaque tree of XML into a flatter JavaScript object/hash that was really easy to deal with, and similarly made a wrapper that would trivially generate the monstrous XML required to send commands and responses back to the server. My JSON-equivalent objects were easier to manipulate (and would also have been easier to deal with in Java or, in this case, C#), equally rich in the information they carried, but could have been serialized with 1/3 the number of bytes per message. Totally a win-win-win.

What I don't understand is why anyone thought using XML that way was a good idea, and why it still is popular in the enterprise. Bad habits are hard to break, I guess.

clockwerx · on Aug 1, 2016

> What I don't understand is why anyone thought using XML that way was a good idea, and why it still is popular in the enterprise. Bad habits are hard to break, I guess.

Namespaces, which then gives you easy answers for Internationalisation (xml:lang), a subject-predicate-object data structure (RDF), which can lead on to logical meaning/modelling of data (RDFS/OWL), which then lets you look at harder questions like trust/provenance.

There's also schema validation (XSD), transformation (XSLT), which then provides you tools like XPath.

Most of that is on the front page for the technology: https://www.w3.org/standards/xml/

The real problem is not syntax, its communication between groups with differing experiences and interests - how do I know your messages mean the same thing as what my system expects?

If you prove to be malicious, do I have to write a strict validator before I trust your input?

If you want to ensure your messages are well formed before they are sent, do you also have to write a validator?

How do I know our validators are checking the same things?

If you want to send a large document oriented data structure, but I only care about a specific section relating to my interests; do I have to understand where to look and what all of the surrounding material is; or can I query for the relevant bits?

On the more complicated RDF side of things - if you want to share identifiers with me, how do we both avoid calling everything record id=1?

If we are both talking about the same thing but know different parts of the story, how can I recognize your information as describing the same thing I know about?

If we both know about the same Thing, and know certain logical facts about that Thing, can we check those facts actually make sense against shared rules?

If we both know about the same Thing, and can see a logical inconsistency in data, can we reason about which data to Trust and why?

Unfortunately, communicating properly is hard even with all of the tools to help.

We tend to opt towards subjecting systems to an ongoing fuzzing test because we don't value many of the above things - we tend to work in organisations with a short attention span focused on the now and a narrow set of interests. It just kind of works for the 80% of the time, so we move on.

Contrast that with something like a library or museum, and you see why ideas like Dublin Core really catch on there.

SomeCallMeTim · on Aug 1, 2016

Sounds great in theory. In practice it doesn't seem nearly as carefully implemented, and/or XML is used where it's actually not needed.

XML is designed to be a markup language. The fact that it has all of these other things bolted on doesn't actually make it a good generic data interchange format.

For things like RDF, maybe it's the best option we have, but that's not because XML is great, it's because XML was used in the only standardized option.

Looking at an example of xml:lang:

    <?xml version="1.0" encoding="utf-8" ?>
    <doc xml:lang="en">
     <list title="Titre en français" xml:lang="fr">
      <p>Texte en français.</p>
      <p xml:lang="fr-ca">Texte en québécquois.</p>
      <p xml:lang="en">Second text in English.</p>
     </list>
     <p>Text in English.</p>
    </doc>

...this is a nightmare. If I want to translate a document, the last thing I want to do is embed each translation inline like that. Almost certainly the best response is to "fork" the document at the highest level and include separate language versions of the document; otherwise, if you have 20 translations of the document, you need 20x the text in the document than any one reader will need.

Yes, XML gives you that particular hammer. But using XML results in a lot of sore thumbs.

Schema validation is nice to be sure. I'm using JSON Schema Validation myself [1] to verify incoming JSON, and I'm automatically generating those schemas from the TypeScript data structure specifications [2]. This is particularly good for a JavaScript language target, of course, but I find XML and XPath to be ugly or painfully slow in every language I've used it from, while JSON just has a better impedance match to data storage and interchange.

[1] http://json-schema.org/

[2] https://github.com/YousefED/typescript-json-schema

cptskippy · on Aug 1, 2016

> What, like JavaScript?

No, JavaScript is not what I'd consider an enterprise language. I'm talking about C++, C#, VB.NET, Java, and LotusScript. Enterprise languages and enterprise applications (e.g. Siebel) have no problem talking to each other via SOAP/XML and they all produce WSDLs that are easily consumed by one another.

When using something like C# or Java you can easily import a WSDL from another application and the toolchain will automatically generate all of objects defined and properly serialize/deserialize XML into those objects. There's no need to write parsers or use sockets/webclients to talk SOAP.

Newer backend languages and frameworks (e.g. RoR, NodeJS, etc) don't have these mature and robust toolchains for XML/SOAP.

int_handler · on Aug 1, 2016

My impression is that the reason why XML is so well-regarded in the enterprise is because these companies are not aware of better alternatives, such as Protocol Buffers [1]. The reason why XML has a bad reputation outside of the enterprise is because it is so incredibly verbose (both the language itself and the code used for working with it), and that all-in-all, it is a sub-optimal solution to a solved problem.

To illustrate: Protocol Buffers' wire format is much more compact. It removes the complexity of having to deal with XML parsers by providing classes generated from the message definition/schema. You can use it with GRPC to implement your service APIs. It is supported for many different languages, including Java and C#. It now even has a JSON mapping [2]. Overall, Protocol Buffers can do everything XML can do as both an exchange format and as a configuration language but better.

[1] https://developers.google.com/protocol-buffers/

[2] https://developers.google.com/protocol-buffers/docs/proto3#j...

HelloNurse · on Aug 1, 2016

Protocol Buffers are just one of many proprietary serialization libraries. Regardless of technical excellence, Protocol Buffers and competing libraries are automatically much less suitable for actual enterprise use than open standard serialization protocols with multiple interoperable implementations, such as ASN.1. And of course, XML is usually preferable to ASN.1 or the like because it is equally standardized but it has an ample choice of implementations, advanced tools and human readability and writability.

int_handler · on Aug 1, 2016

Protocol Buffers is not proprietary. It is open source under the BSD license. Here is the source code: github.com/google/protobuf. It is very much an open protocol, and anyone is free to write their own implementation of it. It is just a standards-based protocol.

If your organization values the existence of a standard over technical excellence, then there is no use in convincing you. Otherwise, in terms of ease of use, performance, tooling, and human readability and writability, Protocol Buffers is superior to XML-based protocols (since the API for converting between the binary and text formats is extremely simple to use).

As a fun fact, if you really wanted to use XML as a wire format, you could even write an XmlFormat ser/de for Protocol Buffers, similar to the JsonFormat that is already provided, but then it would defeat one of the main purposes of using Protocol Buffers in the first place because you would replace an extremely performant wire format with an extremely sub-optimal one.

chii · on Aug 1, 2016

Protocol buffer isn't proprietary. It's just not a standards based protocol. But it doesn't stop you from writing code against it, and you can easily interop with a third party who is using protocol buffers.

brightball · on Aug 1, 2016

Another part of it is that statically typed languages benefit much more from XML with a strictly defined schema like DTD or XSD because it makes it easier to generate the objects that you're going to have to map it into.

With a language like Ruby, PHP, etc that isn't strongly typed it's not nearly as big of a deal. Developers in those languages are used to assuming everything is a string and converting it to something useful without the need to premap every datatype.

That's probably the main reason that XML was so much more popular with the languages you mention compared to the parts of the ecosystem that didn't benefit from it's constructs much (if at all).

ambrop7 · on July 31, 2016

Some time back I needed to generate an XML file in a Java web application. I attempted to figure out how to do it "right". The only "special" requirement was that it is formatted in a readable way.

So I was figuring out the Java XML stuff (don't remember what that was exactly, probably standard). But at some point the timeout in my brain kicked in, and I just wrote a loop generating the XML by brute-force through PrintWriter or something. I even escaped strings right since some library I had available conveniently offered the escape method (Guava maybe?).

chiph · on July 31, 2016

Back in the early days of XML, Internet Explorer would insert "+" characters to fold nested sections of XML. And was the default program to open .xml files. Guess what showed up in the documents I got from an integration partner?

firewalkwithme · on July 31, 2016

It still does! and I get corrupted files like that mailed to weekly by integration partners. I may be wrong but I think FF also adds some crap to xml files when used as a viewer. I actually like xml, for some reason the structure of it makes alot of sense to me, while json is untidy and confusing

Slartie · on July 31, 2016

Guess what caused a serious outage of a system at a customer that I know, with an estimated impact on his bottom line in the seven-digit area? Yeah, right: naive copying of some XML out of IE into the configuration of said system. Including those '+' characters, which resulted in it not exactly being XML anymore.

kyllo · on July 31, 2016

I once got an XML file from an integration partner where the whole thing was XML escaped (all the tags looked like <node>value</node>) because they had embedded it within an outer "envelope" XML file. They saw nothing wrong with this and argued when I questioned it. I wonder how they were planning to express escape sequences within the inner XML document that was already escaped...

nitrogen · on July 31, 2016

It's ugly of course, but a parser should have no problem with &amp; or &lt;. It can go arbitrarily deep.

cptskippy · on July 31, 2016

I think if you select/copied what was displayed you'd get the plus and minus signs. If you saved that wouldn't happen.

Nuffinatall · on July 31, 2016

Compared to the problems when dealing with 'delimited text', XML is great.

Also it's flexible where you can specify properties as attributes or child nodes, depending on wildcard specifications.

So I have dealt with lots of edge-case XML situations, but the solutions are always straight forward. Also it helps to have a client vs. trying to parse out raw XML, which means programming and scripting sometimes relies on personal tool development. XML handles scope creep well.

DougN7 · on July 31, 2016

Handling scope creep is my favorite feature. With XML, it's easy to deserialize even if an expected element is not there, or if there is an extra one you're not expecting, at least that's been my experience. I haven't done much JSON but I'm not sure how that would work with it.

QuercusMax · on Aug 1, 2016

Pretty much any "real" serialization format should handle that situation fine. Protobuf, JSON, YAML, Thrift, heck, even Java serialization can handle that, provided you set a serialVersionUID.

beagle3 · on July 31, 2016

JSON deserialization would basically be the same. XML does not score here.

616c · on July 31, 2016

On the Cognicast there was an excellent tangent (all of them were good) in episode 106, where Michael Nygard bemoans with the fellow Cognitect Craig that, despite all the hate from the JSON generation, the failed promise of XML was the ability (again, that is part, not the whole) to have separated data and presentation with schemas, so you would not have to redesign endpoints all the damn time.

http://blog.cognitect.com/cognicast/106

This is just one view, and I am sure I will be mercilessly downvoted, as this is a gross simplification of that point, but it was one of many gems in that episode. I might finally review XSLT as this once again affirms things other devs told me when they said do not write off XML, in the complexity of it is something interesting.

ams6110 · on July 31, 2016

I loved XML and XSLT. And Internet Exporer, for all its faults, had great support for XSLT in the browser from version 5. It was quite easy to build "rich" single-page apps that get XML data from the server and build various user presentations by updating DOM with XSLT.

erlehmann_ · on Aug 1, 2016

The HTML for my blog is generated by applying an XSLT stylesheet to its feed.

You can see the stylesheet here: http://news.dieweltistgarnichtso.net/posts/atom2html.xsl

You can see the resulting web page here: http://news.dieweltistgarnichtso.net/posts/

sethev · on Aug 1, 2016

I thought of that same exchange when I read this post but remembered it more as a lament that JSON doesn't support namespaces - so JSON is always context dependent.

_greim_ · on July 31, 2016

This article in some ways describes the delta from HTML development to XML development. In the early/mid 2000s, XML was cargo-culted through the tech world on a massive scale; typically being adopted by web developers who proceeded to apply the same habits and tools for XML as they'd been using for HTML. Which of course resulted in many of the issues mentioned.

MichaelGG · on July 31, 2016

There's a popular piece of "newer" software that decided that XML rules were too difficult. So they URL encode all values. It also uses print style formatting for XML tag names, so if you manage to get a name value that has, say, a : in it, you'll get invalid tags. This is the default setup, in 2016, for a system that handles a lot of real-world telephone calls.

Even just a few years ago I've worked with companies that wrote their own "XML parser". They explained it was pretty easy but they had to "special case" for broken output in the real world. An example of this output? "<tag />".

HTML would have been far better off if it had the strictness of XML. Remove end tag names so you can't have invalid nesting. If browsers had refused to parse invalid docs from the start, invalid docs would not have been produced. (And like XML, they could provide decent error messages, so the difficulty would not be significantly raised.)

beagle3 · on July 31, 2016

I used to hate doing XML in Python - ElementTree was the nicest of them 10 years ago, but it still hurt.

But last year, I discovered xmltodict[0] and since then, I don't really care - it makes doing xml (both reading and writing) no more cumbersone than using dicts, while still supporting stuff like namespaces, CDATA and friends.

I still think XML is a horrible, misguided idea - from inception, but even more so in how it is used in practice - but I no longer feel any pain interfacing with it.

[0] https://github.com/martinblech/xmltodict

Mikhail_Edoshin · on Aug 1, 2016

Python has a very good lxml module for advanced XML processing. You can define your own classes for XML elements, so you can read an XML file and get your own classes for the underlying elements. They're somewhat limited, you can easily define methods, but the data is locked to what's in XML. You can also define your own XPath functions and XSLT extensions. Comes very handy sometimes.

The API is still rather awkward though.

zo1 · on July 31, 2016

I think a big problem with XML in most languages is the tooling around it. The libraries to parse/create it are not very pleasant to work with because of the immense complexity they have to deal with. If they only had to conform to a very small subset of all of XML's features and quirks, you'd have a very sane ecosystem.

legulere · on July 31, 2016

There's really no reason to use UTF-16 but compatibility with older software (which is usually broken when handling surrogate pairs). It's an atavism from times when all unicode codepoints fitted into 16 bit.

yuhong · on July 31, 2016

I think that one boils to basically back in 1990, ISO 10646 wanted 32-bit characters but had no software folks on that committee, while the Unicode people was basically software folks but thought that 16-bit was enough (this dates back to the original Unicode proposal from 1988). UTF-8 was only created in 1992, after the software folks rejected the original DIS 10646 in mid-1991.

Agentlien · on July 31, 2016

This reminds me of an interesting experience I had with XML at a pervious job a few years ago.

We had bought a product from another company which was to be integrated into our own main product. Theirs was horribly ugly, looking like a cross between a 90's website and an infomercial, predominately in vivid shades of pink and purple. And it was really buggy. I soon noticed that all the content (many hundred pages with text, video and interactive content) was specified in a giant XML file and that the application itself simply interpreted this file and presented it to the user. We quickly decided that the best course of action was for me to reverse-engineer this XML file and write our own code to generate an integrated version of it, presented in a visual style more in line with the rest of our own product. This meant we could also solve some of their bugs on the way.

I still feel this was the only reasonable option and it did work out within our given time frame. However, I will never forget the horrors I saw in that one file. A few gems included:

- The file was most certainly handwritten with lots of tag mismatches and spell errors in tag names.

- One of the main sections was missing in their own standalone version because of a syntax error which caused their program to skip over the entire main branch of the syntax tree in which it occurred.

- Exercises where you had to order a list of items were defined as dragging items into hit boxes on a static bitmap image of the numbers 1-10 on a purple background. The same image was used regardless of how many items had to be ordered. The hit boxes didn't align with those numbers at all and often overlapped. In their implementation, Items were stuck right where you dropped them, rather than snapping to a fixed position by the right number.

- We wrote a few tools to identify images and videos which were either present on disk but never referenced or vice versa. This was often a case of spelling errors, slight variations in word connotation or files placed in the wrong folder. In these cases, their original program would bail out and skip that page.

- Indices of chapters were written as plain text rather than inferred. They did not match how things were laid out in the XML and where it happened to align it was sooner or later broken by sections which were commented out or failed to parse.

There were many more issues, but these give some insight into the exciting challenge of getting their data to work in a consistent and logical manner. After the XML file had been thoroughly massaged into submission and uniformity, of course.

jessaustin · on July 31, 2016

Please edit your post to eliminate the fixed-text:

- It will be easier to read.

- Reading won't require a lot of fiddly trackpadding.

- Maybe it would be nice if HN's simple markup system could handle the case in which the author wants a list of indented items, but it doesn't, and fixed-text is a poor substitute for that.

[EDIT:] Thanks!

Agentlien · on July 31, 2016

There. I agree, it looked horrible

rwmj · on July 31, 2016

This is by no means totally bulletproof, but these C macros around libxml2 let us write nested well-formed XML expressions as code:

Example usage: https://github.com/libguestfs/libguestfs/blob/master/src/lau...

Macro definitions: https://github.com/libguestfs/libguestfs/blob/master/src/lau...

oceanswave · on July 31, 2016

Totally, we took this a step further and created a subversion repository where xml documents describe classes. Each method is either inline, or is described by a xml element of a particular namespace that links to a subversion id and revision. ;)

witty_username · on July 31, 2016

Note: I believe this is a reference to http://thedailywtf.com/articles/the-inner-json-effect

maze-le · on Aug 1, 2016

Some XML dialects become very confusing if features are added as an afterthought without consideration of syntax and sematics. Microsofts Wordprocessing XML for example has caveats like w:permStart:

    <w:permStart w:id="0" w:edGrp="editor"/>
    (...)
    <w:permEnd w:id="0"/>

permStart and permEnd define regions where special permissions are required to edit a document. It is encoded in a complete anti-XML syntax, where different tags (and a common ID) represent the start and end of a region.

Mikhail_Edoshin · on Aug 1, 2016

Microsoft Wordprocessing XML is very quirky :) I think they use these markers because different areas can overlap and thus you cannot express this with a tree-like structure.

coding123 · on July 31, 2016

There are many flavors of XML and JSON out there now. I think for many developers JSON started to "look good" when the number of standards that started stacking up against XML (and XML-ish/SGML-ish/HTML-ish based formats) started to make people go insane. In the healthcare world we typically had to deal with a never ending set of "format standards" that kept integrating themselves together. I guess originally that may have been the beauty of XML... we started with XML RPC, moving on to SOAP 1.0, SOAP 1.1 introduced new ways to send headers. At some point however it just went crazy.. I think kinda when the enterprise-level people got their hands on things, they started porting all of their non-standard wack-job features into XML.

WS-Addressing - ok seems simple, but now your SOAP stack has to support async processing. WS-Trust - OK Let's add a simple feature that lets you put "some tokens" in the request and response for security, auditing, non-repudiation - good ideas sure. WS-Eventing - Let's add enterprise queuing to XML and soap and require stacks to support that, let the users of the stack figure out a way to connect that to the queues.

Anyway the list goes on, and you can read about it here: https://en.wikipedia.org/wiki/List_of_web_service_specificat...

Suffice it to say, but XML died because the developer now had to learn all of these, how they worked because one tiny industry body starts to adopt 1% of each, requiring implementors to learn 99% of all. It basically just made JSON attractive, a reset if you will.

XML won't go away. HTML will continue forever (it crosses a developer-designer "human line" that makes it kinda permanent) Developers adapt to future technologies a lot faster than designers and other's dabbling in HTML.

Now all this being said, you can see the list of standards piling up against JSON. There's really no critical mass ready replacement though, so JSON will be safe for quite a while longer. JSON will only be replaced in various "areas" like YAML for config, binary JSON-compatible representations for wire and/or storage.

I'm not biased against XML for data transfer, but if someone asked me to create a SOAP 1.1 service with WS-Trust, SAML tokens, etc... I'd also argue for a more industry accepted REST service with OAuth tokens, simply because it would be like introducing the Hummer all over again in age where Tesla's are everywhere. - everyone would hate us.

Marazan · on July 31, 2016

XML is a perfectly fine format that was (ab)used dreadfully by many, many people to such an extent that many people only have examples of completely dreadful XML as their reference.

So many XML-as-interpreted-programming-language monstrosities out there (I know I wrote one as I had the perfect problem domain to use LISP but didn't have the environment capability to use LISP but did have a Database XML field to store 'data' in so I did XML-as-S-Expression with a SAX based interpreter - it was surprisingly nice).

erlehmann_ · on Aug 1, 2016

Discussions about XML and JSON often remind me of this comment on HN: https://news.ycombinator.com/item?id=5702868

Partial quote:

> XML can certainly be shorter than JSON and often is, and repeated tags are the best showcase for it:

     <user id="abc">
        <phoneNo type="home">123456789</phoneNo>
        <phoneNo type="work">321654987</phoneNo>
    </user>

> This turns into this beautiful JSON:

     [
      "users": [
	{
	  "id": "abc",
	  "phoneNos": [
	    { "type": "home", "value": "123456789" }, 
	    { "type": "work", "value": "321654987" }
	  ]
	}
      ]
    ]

lmm · on Aug 1, 2016

Not a fair comparison since the JSON case includes the outer list as well. And whenever I've seen the equivalent of this in a real-world XML format it would use a <phoneNos> tag to group the phone numbers together.

erlehmann_ · on Aug 1, 2016

You probably have not looked too closely at real-world XML.

• Many XHTML and SVG elements can occur without dedicated wrapper elements.

• In Atom feeds, <author>, <category>, <contributor>, <link> elements can occur multiple times without a dedicated wrapper element.

• In XSPF playlists, <link>, <meta>, <extension>, <location>, <identifier> elements can occur multiple times without a dedicated wrapper element.

stesch · on July 31, 2016

Had to post this old article because I encountered some bozo code again. Reading more about some CMS and planning on using it for my blogs when I saw the code of the RSS feed. It was written by the lead developer of the CMS and used text templates.

oceanswave · on July 31, 2016

The way your comment comes across is a bit irritating. Not understanding the underlying codebase and classifying based on an attenuated knowledge of a topic promotes one to the 'bozo' status more quickly than not. Many systems use text-template based feeds, examples are Shopify, Salesforce, Wordpress, and more. Are these systems fundamentally broken purely because of this approach? Probably not. In your case, are the text templates escaping their values when outputting? Are they validating for correct XML once generated? Have more questions than having pre-defined answers.

stesch · on July 31, 2016

You mention typical PHP projects written by people who think they know better than the likes of Tim Bray.

PHP, the language that made short tags a configuration option because they wanted to mix program code with XML.

PHP, the language with a lot of different escape functions because they didn't get it right the first time.

oceanswave · on July 31, 2016

I also mentioned projects written in Ruby and Java, but that's ok. VB.Net also has XML Literals. Ha ha.

MichaelGG · on July 31, 2016

VB's XML literals are just shortcuts for creating the corresponding classes though, right? That's quite a bit different.

oceanswave · on July 31, 2016

Was being a bit sardonic in my comment due to where the discussion went, but yeah, XML Literals in VB.Net create XDocument instances and are just like string literals except:

* Enclosing quotes aren't required * Assumed to be multi-line so line continuation characters aren't required * Are validated for being well-formed XML by the compiler (and at design-time, if VS) * Can have embedded expressions

oceanswave · on July 31, 2016

The author of this post is a bozo, doing any (or not doing any) of the suggested things does not guarantee well formed XML. Disregarding whole sections of the XML spec, prescribing a certain way to generate xml are more harmful than not. Can text templates generate well formed xml, absolutely. Can tools generate non-well formed xml, absolutely.

wbkang · on July 31, 2016

> Making mistakes with them is extremely easy and taking all cases into account is hard.

He states why right there. He doesn't say anywhere whether templates can or cannot generate well-formed xml.

hsivonen · on July 31, 2016

(I'm the author of the article.)

Today, it's clear that text/html has won over application/xhtml+xml and JSON has won over XML for most (non-enterprise) non-document uses. But back around 2003..2009, there was no shortage of people who advocated in favor of XML and got it wrong when writing it by hand or when generating it with text-based templates.

Philip Taylor (not to be confused with Philip TAYLOR) was one of the regulars on the #whatwg IRC channel around 2007..2009. He had a hobby of trying to get XML advocates' systems to produce ill-formed output. He pretty much succeeded every time. IIRC, he even found a bug in Validator.nu's XML output, even though Validator.nu practices what I preach in the article.

The easy way was to supply user input that contained U+FFFE and watch the output blow up with the Yellow Screen of Death when U+FFFE was echoed as-is. Unless you have a templating system designed with the warts of XML in mind, this will happen. (A proper XML serializer has to scrub the characters that aren't allowed in XML, as seen in https://hg.mozilla.org/projects/htmlparser/file/dd08dec8acb7... .)

He even found a bug in Tim Bray's code that was written to make the point that it's possible to generate XML correctly... (https://lists.w3.org/Archives/Public/www-archive/2009Mar/006...)

Cozumel · on July 31, 2016

The sheer amount of sites that produce badly formed RSS feeds is staggering, the whole point of a feed is to make your content accessible to everyone, a bit like meta-tags. Why have it if you're not going to at least implement it properly?

nialo · on July 31, 2016

I recently wrote a first pass at an RSS feed parser for podcasts, but couldn't find examples of interestingly malformed podcast feeds to test against. Do you have examples of sites with badly formed RSS feeds?

0x0 · on July 31, 2016

I had no idea there's such a beast called "XML 1.1". That sounds fun!

foota · on July 31, 2016

This reminds me of when I was just starting out as a programmer. I was doing contract work and needed to write a php json endpoint. I had no idea what I was doing and hardcoded it all with print statements. Yikes.

benbristow · on July 31, 2016

Why would anyone choose to use XML over JSON, other than for RSS?

vbezhenar · on July 31, 2016

I can parse/print XML (using either in memory parser or streaming parser), use XML Schema to validate XML, XPath expressions to select necessary parts, automatic object mapping, and that's all with standard library without a single external dependency in Java. I don't know why would I use JSON over XML, unless I have very good reasons to do so.

For me the only thing that JSON got better is that JSON is directly mapped to commonly used data structures: arrays and maps.

moron4hire · on July 31, 2016

> For me the only thing that JSON got better is that JSON is directly mapped to commonly used data structures: arrays and maps.

This is nice, but it's also kind of a pain, as it makes you have to stop and think about which structured data elements it's capable of supporting and which you have to send your own metadata through the wire and then reconstruct on your own. For example: Dates. Which is a shame. If there is one data element I want the most help with serializing/deserializing, it's freaking Dates. All the other ones are super easy in comparison. There's just way to much subjective, dirty, human culture tied up in Dates.

The only thing that I think is objectively better in all cases about JSON over XML is the less verbose end-structure syntax. I think XML only has Attributes because Tags have this silly need to state their name both as they enter and exit the room.

    <Tag1 attr1="hello" attr2="world">
      <Tag2>how</Tag2>
      <Tag2>are</Tag2>
      <Tag2>you?</Tag2>
    </Tag1>

For simple, unnestable data elements, having the more efficient Attribute starts to look attractive.

If XML were more the form:

    <Tag1 attr1="hello" attr2="world">
      <Tag2>how</>
      <Tag2>are</>
      <Tag2>you?</>
    </>

It's actually only one more additional, required character to add than if the attribute value were specified as an element instead.

    <Tag1>
      <attr1>hello</>
      <attr2>world</>
      <Tag2>how</>
      <Tag2>are</>
      <Tag2>you?</>
    </>

Heck, why stop there? Do we really need to have quite all of those angle brackets now? How about we just get rid of all the ones we can assume:

    <Tag1
      <attr1 hello>
      <attr2 world>
      <Tag2 how>
      <Tag2 are>
      <Tag2 you?>
    >

And finally, who even likes angle brackets? I've never enjoyed the dual duty they play as delimiters in XML and operators in other languages. Let's use a common set delimiter, something like square brackets or maybe parentheses.

    (Tag1
      (attr1 hello)
      (attr2 world)
      (Tag2 how)
      (Tag2 are)
      (Tag2 you?))

Now where have I seen this before?

PS: JSON that is as nearly as equivalent as I can make it is not much less verbose than original XML, and requires some level of convention to make up for the differences:

    {Tag1: { attr1: "hello", attr2: "world", children: [
      {Tag2: "how"},
      {Tag2: "are"},
      {Tag2: "you?}]}

Though I'm sure in common practice it'd have a lot of the original metadata of the XML version thrown away:

    {attr1: "hello", attr2: "world", children: [
      "how",
      "are",
      "you?"]}

mbrock · on July 31, 2016

The XML tag style is much, much easier to work with when you're dealing with markup. And XML's purpose is to be an Extensible Markup Language. It's way more appropriate than JSON or S-expressions for that.

(Do you prefer to write HTML documents as S-expressions?)