Hacker News new | past | comments | ask | show | jobs | submit login
How to Avoid Being Called a Bozo When Producing XML (2005) (hsivonen.fi)
107 points by stesch on July 31, 2016 | hide | past | favorite | 244 comments



My "favorite" XML formats are the one that are just some kind of weird meta-format and don't really use any of the XML features:

   <format>
      <record id="1">
         <field name="id" value="1"/>
         <field name="name" value="abc">blah blah</field>
         <field name="attribute">this is the attribute value</field>
         <field name="end_of_record" value="True"/>
      </record>
      <record id="2">
      ...
      </record>
   </format>
And yes, these types of abominations are everywhere.

The only way to avoid being called a Bozo when producing XML is to either

a) ensure that humans never had to see this craziness

b) don't use XML

XML as a config file format, in particular, is probably one of the worst ideas in computing.


Here is an event from a popular sports data provider's XML format, for your delectation:

    <Event id="524717408" event_id="1" type_id="34" period_id="16" min="0" sec="0" team_id="20" outcome="1" x="0.0" y="0.0" timestamp="2014-11-30T12:29:59.446" last_modified="2014-11-30T13:24:03">
      <Q id="2045368832" qualifier_id="59" value="23, 2, 21, 12, 6, 17, 8, 4, 19, 11, 10, 1, 3, 5, 7, 24, 28, 33" />
      <Q id="1068483434" qualifier_id="227" value="0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0" />
      <Q id="840260679" qualifier_id="197" value="425" />
      <Q id="1586850783" qualifier_id="30" value="40383, 57328, 40146, 54756, 38580, 55605, 17339, 42774, 17784, 62399, 110979, 3673, 80447, 84395, 20452, 49596, 153366, 169359" />
      <Q id="340265857" qualifier_id="44" value="1, 2, 2, 3, 2, 2, 3, 3, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5" />
      <Q id="328261435" qualifier_id="194" value="38580" />
      <Q id="1426777221" qualifier_id="130" value="4" />
      <Q id="293008363" qualifier_id="131" value="1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 0, 0, 0, 0, 0, 0" />
    </Event>
XML and CSV, together at last.


Reminds me of the beautiful paths in svg:

     <path
       id="path4136"
       d="m 141.42136,428.08793 c 5.24568,-16.5136 15.9393,-31.24659 30.01423,-41.35166 14.07492,-10.10508 31.45369,-15.52663 48.77766,-15.21688 13.79473,0.24664 27.51957,4.08979 39.39595,11.11168 11.2946,6.67792 20.92213,16.25825 27.27185,27.74057 6.34973,11.48232 9.35256,24.85978 8.08349,37.91934 -0.97817,10.06598 -4.47673,19.87936 -10.10152,28.28427 -7.66405,11.4521 -18.89192,19.94346 -29.67188,28.52721 -10.77995,8.58374 -21.59293,17.81326 -27.90682,30.06164 -5.96111,11.56401 -7.38898,25.49638 -3.38484,37.87491 4.00414,12.37853 13.52214,22.96957 25.6082,27.78501 6.7156,2.67569 13.99861,3.58421 21.2132,4.04061 19.62989,1.24181 39.40632,-0.70279 58.58885,-5.05077 14.7604,-3.34565 29.1633,-8.0984 43.43656,-13.13198 20.00787,-7.05594 40.67497,-15.26376 54.54824,-31.31473 4.77196,-5.52102 8.57644,-11.80437 12.12183,-18.18274 19.76105,-35.55128 32.20013,-75.50916 33.33503,-116.16755 0.65168,-23.34676 -2.46779,-46.99293 -11.11168,-68.69037 -5.01987,-12.60061 -11.95904,-24.5794 -21.47642,-34.24345 -9.51738,-9.66404 -21.73721,-16.92383 -35.09212,-19.29463 -2.34449,-0.4162 -4.71259,-0.68237 -7.07107,-1.01016 -18.71745,-2.6014 -36.77133,-9.07291 -55.55839,-11.11167 -15.41222,-1.67253 -30.96773,-0.32564 -46.46701,0 -8.75302,0.1839 -17.50911,0.0413 -26.26397,0 -19.79058,-0.0933 -39.767,0.35345 -59.02055,4.93345 -19.25355,4.58001 -37.93007,13.61662 -51.08608,28.40158 -10.37036,11.65439 -16.79892,26.19465 -23.23351,40.4061 -3.97708,8.78379 -8.01783,17.53876 -12.12183,26.26397"
       style="opacity:1;fill:#000000;fill-opacity:1;stroke:none;stroke-width:2;stroke-linejoin:miter;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" />


The biggest advantage of XML was the detailed schema validation. Having a uniform and flexible way to both generate data structures and ensure that their contents was valid before ever attempting to process them was handy.

XML had a lot of warts but most of its strengths are still seeking passable implementations in JSON. Protocol Buffers is probably the closest thing to being standard in that area for schemas and generation. The number of JavaScript templating options out there are trying to fill the XSLT gap.


It's also that it's extensible (primarily because of namespaces) - you can mix and match schemas so long as one of them uses xs:any. This brings up another way to avoid being called a bozo: namespace your XML. You're throwing a major advantage away if you don't, and if you don't need/understand that advantage then you're better off using a different serialization format.


Yeah, I acknowledge that bit.

But I also think the various schema definition languages (DFDs, XSD, whatever) turned out to be either not expressive enough (DFDs), or a complete PIA (XSD) and in the end, they weren't used very often.

Still, it's nice to have them when you need them and when they aren't there it hurts.


Relax NG validation is very good; it's more expressive than XSD and it has both XML and non-XML forms and looks very nice.


I'll definitely agree with that. I remember using some Java desktop GUI to generate XSD's rather than typing all that stuff out.


I don't necessarily disagree, except for the last point. I've rarely (never?) encountered XML used as a config file format where users were expected or encouraged to edit that config file directly vs. using other tools or APIs to touch the file.

In those cases, I would rather have XML config files than undocumented binary blobs as config files. When I see an XML config file, I feel a little relief that it's not a binary blob rather than disappointment that it's not freeform text, because I assume that freeform text must've been off the table for whatever reason (which, depending on what the config file is for, can be a totally rational and reasonable thing to do).

I don't work in specialties where XML has a ton of visibility though- maybe there are lots of projects out there that I don't use in which people are required to hand-edit XML config files, as opposed to "it's in XML, so you could edit it directly, but really no one should be modifying the file with a text editor unless the preferred indirect mechanism isn't an option in some specific case".


>In those cases, I would rather have XML config files than undocumented binary blobs as config files.

False dichotomy.

Better than XML and binary blobs:

* JSON (assuming everyone knows what this is)

* YAML [0]

* Lua tables (if you're already using Lua as a scripting language; Lua started out as a configuration language after all)

* INF format [1] (not my favorite, but pretty easy to parse and much better for humans to read than XML)

* Any of the above compressed with a gzip compatible compression (if size matters, though it rarely does these days)

Even Protocol Buffers [2] are better than XML, though at that point it becomes a "documented binary blob". But as long as the spec is shared, the format can easily be read by just about any programming language.

[0] https://en.wikipedia.org/wiki/YAML

[1] https://en.wikipedia.org/wiki/INF_file

[2] https://en.wikipedia.org/wiki/Protocol_Buffers


> JSON (assuming everyone knows what this is)

The new .net uses json, it's awful. No comments allowed and it get's pretty unreadable when you have nested configuration elements.


I seriously think the lack of comments is a deal breaker for JSON config files for me. At least with what I'm doing now. I find myself changing configs a ton, and I love being able to simply change which blocks are commented to get what I want, without having to dig anywhere.


I agree...and I found a Gulp plugin that lets me pre-strip comments from my JSON files as part of the build process.

So I use JSON-with-comments, but the app only sees the stripped files.


VSCode uses comments for every line in its settings.json file.

I guess they figured it may not be correct JSON, but since they aren't sending those particular JSON files anywhere it doesn't matter?


JSON5 [1] is an extension to JSON that allows comments, multi line strings, additional commas at the ends of lists, and more. It has become my preferred config file format.

[1] http://json5.org/


And instead of Protobufs, Cap'n Proto [1], which was started by one of the principal author behind Protobufs, to fix all the flaws in Protobufs.

[1] https://capnproto.org/


You can use Protocol Buffers for configs without having to serialize them in its binary format. Protocol Buffers has always had its own text format [0] and now a JSON mapping [1] as well.

The proto text format is actually more flexible and less verbose than JSON since it does not require the outer enclosing set of braces and quotes around all the keys and has support for comments.

Here are a couple of examples of config files using the Protocol Buffers text format:

* Bazel CROSSTOOL: https://github.com/bazelbuild/bazel/blob/master/tools/cpp/CR...

* SyntaxNet: https://github.com/tensorflow/models/blob/master/syntaxnet/s...

[0] https://developers.google.com/protocol-buffers/docs/overview...

[1] https://developers.google.com/protocol-buffers/docs/proto3#j...


GP is obviously not stupid enough to think that XML and binary are the only options. Their whole point seemed to be that they've seen enough binary blobs in practice that even XML was a welcome step up.


I've always considered YAML to be far too complicated. There are many overlapping/redundant syntax rules for doing the same thing, lots of ways to mess up parsing, etc.


True, but if you turn those "features" off and swap out implicit typing for explicit typing it becomes a much simpler language.

This is what I ended up doing:

https://github.com/crdoconnor/strictyaml


I'd say TOML [0] is the best because it can be a very simple key=value structure, but also supports very detailed, nested structures. It has a 1-to-1 correspondence with JSON, but is more friendly for configuration (comments are a huge help!)

[0]: https://github.com/toml-lang/toml


I am missing a config file format / parser&generator lib that preserves every comment and format (empty lines, etc.) after a read/write cycle.


I've written that for Lua files. And I've seen it for XML, to be fair.


I've mostly come across XML config files that are meant to be edited by humans in various programs that use some kind of Java framework as the back-end.

I don't Java much, so I'd be hard pressed to remember the various framework names (Spring maybe?) but I remember at one point writing a Python script to de-XML the config files into something that was just a bunch of key=values, then another Python script to convert it back to the required XML. IIR on that project, the handful of config options that needed to be tweaked were spread across a dozen or so different XML files.

If the framework could have just read a .txt file with key=values in it, config changes would have gone from 10 minutes to 30 seconds. I eventually just wrote a python thing that auto-deployed and configured the entire stack after asking you a couple questions.

It was absurd.

I believe Android development does (used to?) require lots of hand XML editing. Most of which just drives a Java code generator. I guess the tooling is better these days, but it was enough to drive me away.


Spring was certainly a very XML focused framework.

However it has had for a long time ways of using property files in conjunction with XML, while you would still need the XML to define your dependencies, you could have a simple property file for runtime configuration.

Thankfully in newer releases and with spring-boot you can avoid XML entirely.


Config files are usually plain key-value pairs and, of course, using a whole eXtensible Markup Language for them is kind of overkill. But if your config files are more complex, say, you need a Makefile-like stuff, then XML is more than appropriate.


There are plenty, especially in the Java and .NET worlds. To name a few:

* Ant/Ivy * Maven * MSBuild * NuGet package configs


>> I've rarely (never?) encountered XML used as a config file format where users were expected or encouraged to edit that config file directly

I think it's more like, it's a text format (no matter what the op recommends) so it can be edited. If you don't want anyone editing your configuration you don't store it in a text file, right?

Not to mention stuff like pom files that are explicitly meant to be edited by hand. Gods, why?



There....that's a perfect example. Thank you.


Well, if you have to store arbitrary, opaque JS object in XML document, it is much better way than JSON in CDATA


You're going to "love" this little guy: http://txti.es/barry/xml

Only slightly better is the JSON counterpart: http://txti.es/barry/json


Microsoft pretty much standardised on XML all through the early .NET framework - app.config and web.config, plus most project files are XML files, and defining your own configuration (past simple key/values) is very tricky and error-prone.


How would you write that example while taking advantage of the XML features you're talking about?


I think what you might be asking is "what would more idiomatic XML look like?" And that's a fair question for people who haven't spent lots of time working with XML.

First off, I think attributes are evil. In theory, they're good, but nobody knows how to use them, so they shouldn't ever be part of your XML. They're simply elements with cardinality of 1.

The format would probably be better as:

   <format>
      <record>
         <id>1</id>
         <name>
            abc
            <comment>blah blah</comment>
         </name>
         <attribute>this is the attribute value</attribute>
      </record>
      <record>
      ...
      </record>
   </format>
This is a completely valid XML language, is much more clearer, less verbose, doesn't overload element names, doesn't abuse attributes, etc. etc.

One important thing that most people don't get about XML is that XML is a specification for describing data-interchange formats. XML isn't a format, or a language. The result of following the XML spec is an "XML format".

If somebody asks what format some data format is in, it's more appropriate to say "it's in an XML" rather than "it's in XML".


Well this reduces to something like:

    <abc>
      blah blah
      <attribute>this is the attribute value</field>
    </abc>
The point is not to write a flexible meta-format for expressing arbitrary objects, because XML is already that. Each specific thing you want to express should have its own specific format. That way you can actually use the validation features too.


> Don’t print

> Use an isolated serializer

Some old reference material (XML isn't as common as JSON anymore), but still worthwhile learning: don't output data formats directly. Directly = echo, print, printf,println...whatever your syntax suggests. I see this happen a lot with my junior engineers, and I have this same conversation with them.

Prefer to use data serializers that encapsulate all the syntactical rules that go along with XML, CSV, JSON, YAML, etc. Let the serializers do the grunt work of writing output in correct format.

Some serializers aren't always ideal - correctness and speed can be an issue. Nonetheless, prefer to use those mechanisms over writing your own output.


It's a classic trap that appears to get software developers caught (even now in 2016!) at that point where you are still lacking that firm grasp of the standard libraries available to you.


Yep. As I usually say, it is better to be able to say "not my problem" than it is to be able to say "invented here."

The single greatest strength of a skilled software engineer is to know when to make it someone else's problem.


Even if it is not your fault it can still be your problem.


Which is why judgment in that question is so valuable.


i think a major problem is that XML kinda looks and feels like HTML (and there was the whole XHTML thing to further confuse), and outputting HTML programmatically (vs string / print / template based) has most been frowned on as overweight and cumbersome.

you come from web dev doing HTML like that and you see XML and think "hey, that looks the same, i'll do it in the same way".

XML is a programmatic data exchange format like JSON or YAML, which most people would never think of outputting as templates or printed text, but it looks and feels like HTML, which most people deal with first and where that's the standard approach.


>YAML, which most people would never think of outputting as templates

Don't tell the Ansible folks!


Ansible uses Jinja2 to output templates in whatever format is preferred by the thing being configured. I haven't personally seen Ansible used to output YAML... But people will do anything :-P

Ansible does use YAML as a configuration language though—something for which it's perfectly suited.


Well, some frameworks use yaml for config files and you might use ansible to write those.

That said the templating is usually trivial, just maybe write some string values.


I’ve done it. It’s painful enough that it teaches you “don’t do this!”. For example, you need to escape `{{ item }}` as `{{ "{{" }} item {{ "}}" }}`!


Outputting JS as templates or printed text was pretty common before every language added a handy toJson method though.


yeah, I guess I sorta missed that... I mean the 'X' in AJAX was for XML... I've def been guilty of outputting "XML" with php tags. by the time we got to JSON there were libs available, or maybe we just wrote our own.


Surely it depends where your output is going? Print and friends are ideal for producing human-readable output, especially when it is temporary, for monitoring or debugging. And they are awful for producing stable machine-readable output which you might want to store.

If I'm trying to output straight to a user sitting in front of a terminal, they are going to be very unhappy if I output XML at them. And if my program only outputs machine-readable and requires another layer to turn it into something human-readable, that seems overcomplicated for most applications.

Have I missed the point, or is this advice intended for more specific scenarios than I imagined?


I think the point is to use a serialization library when you are trying to output a structured format rather than writing a half-assed use-case-specific implementation of one.

Print and friends are appropriate when not attempting to produce data that conforms to any particular structured format.


I think the benefit of a serialization library is going to depend on how complex and dynamic your actual output is. I've done XML-by-printing, but in that case the XML elements were fixed scaffolding with no relation to our internal object hierarchy (A containing array of B containing array of C containing array of D, always, regardless of how our application changed). It was also on an embedded system for which adding libraries was kind of painful.


If I need to communicate with a couple of external endpoints that need 5-10 lines of mostly static xml amd the templating is simple i often might prefer using a static templated xml file.

It's much easier to understand what's happening later.


I use XML for a combination of features that I consider very important but are also perceived as "overkill": A source syntax that has already handled text escaping and encoding, lets me add some abstract structure, and lets me encode the text in a way that lets me nest different parsing modes for various kinds of structured data.

The first two are easy enough to get with your pick of JSON or S-Expressions. For a lot of things even CSV is enough, although CSV has the downside of being so simple that people opt to write an incorrect toolchain for it themselves instead of adding a dependency.

But it's the last feature that really produces the complexity. Once you get into "I want the inner structure to contain a different and unambiguous semantic meaning from the outer structure" you have a pretty substantial engineering problem. Less structured approaches like JSON or S-Expr's drop the problem on the floor by declaring one universal semantic, making the programmer deal with adding anything else on top. XML's compromises to achieve a more detailed representation of data involve the angle bracket tax, schema languages, etc.

If you want a guarantee that a rich data source can be processed correctly through an n-tier architecture that emits various radically different outputs, these compromises become compelling. I'm a big fan of DocBook, for example, and its canonical toolchain is an XSLT style sheet: The workflow I end up with is initial writing in a light syntax of choice, compile to DocBook XML, add additional formatting and styling in the XML, and then emit the final document in whatever forms needed - HTML, PDF, etc. It's extremely flexible, and you wouldn't get the same quality of result with a less extensive treatment.

For ordinary data serialization problems and one-offs, it is considerably less interesting.


XML is well regarded in the enterprise and languages like JAVA, C#, and VB.NET handle is spectacularly as an exchange format.

I think it's bad reputation comes from anyone not using an enterprise language because the support just isn't there.

I recall working with a partner who we were doing an identity federation with. Our system was using WS-Trust which is a SOAP/XML protocol. It wasn't ideal but everyone seemed to support it ok. These guys were cutting edge though and used Ruby on Rails.

No support for the protocol wasn't a huge deal, just means you have to craft your XML for your SOAP calls yourself. But at the time we were doing this, RoR didn't have SOAP or XML libraries. They had to write everything from the ground up. It sucked for me and I was just fielding rudimentary questions, I can't imagine how painful it must have been for them.


> I think it's bad reputation comes from anyone not using an enterprise language because the support just isn't there.

On the contrary, I think that XML's bad reputation comes from the fact that it is <adverbial-particle modifies="#123">so</adverbial-particle> <adverb id="123">incredibly</adverb> <adjective>verbose</adjective>.

Also, the whole child/attribute dichotomy is a huge, huge mistake. I've been recently dealing with the XDG Menu Specification, and it contains a child/attribute design failure, one which would have been far less likely in a less-arcane format.

XML is not bad at making markup languages (and indeed, in those languages attributes make sense); it is poor at making data-transfer languages.

JSON has become popular because a lot of bad programmers saw nothing wrong with calling eval on untrusted input (before JSON.parse was available). It's still more verbose than a data transfer format should be, and people default to using unordered hashes instead of ordered key-value pairs, so it's not ideal.

The best human-readable data transfer format is probably canonical S-expressions; the best binary format would probably be ASN.1, were it not so incredibly arcane. As it is, maybe protobufs are a good binary compromise?


I think the worst of this is what I call semantic incoherence.

I have a system that has things like <Task ID="6">Blah</Task>. Why is the ID, clearly always an integer in every sample of hundreds I see, represented as a string?

Another favorite: <ExecuteCommand>[CDATA[Batchfile.bat]]</ExecuteCommand>, while a binary or something else will be <ExecuteCommand>"program.exe /argument:f /argument2:x"</ExecuteCommand>.

By the way, this is an enterprise as it gets: a software tool from a four-letter hardware company, quite huge, trying to sell off its software division. I wonder why.

XML is like all other "crap" tools: Java, PHP, SOAP: some people do not grok the spirit of the law, and they do weird things that reflect their discomfort and hurried need to operate with it. Many write it off.

I agree with your points, this is just my corrolary. The sad thing is SEXPR and XML are not far removed, one is arguably a subset of the other, and notice how people lose their shit when you ask them to consider Lisp languages for daily works because "all those parens are stupid" and how the culture surrounding a potentially viable tools makes people close up without delving in with curiosity.

https://en.wikipedia.org/wiki/SXML

http://arclanguage.org/item?id=19453


> Why is the ID, clearly always an integer in every sample of hundreds I see, represented as a string

Becase XML is a text-based markup. If you truly want binary data you need to encode it and use CDATA sections.


That was not quite my point.

Why pretend it is a string at all?

<Task> <ID>3</ID> </Task>

I should have been more clear. Sometimes you have these argument type deals <Task ID="3"> where I would at least hope for <Task ID=3> or the monstrosity above (I assume ID=3 is not valid in hindsight, I am getting tired just writing this all on the second pass even!). And I see all different variations in the same XML file! There is no logical consistency, not even in the same config for the same function of this multi-stage system.

I am not even a novice programmer, and I find the variation annoying, and sometimes hard to reason about when I want to know what the hell the programer was thinking.

The valid part for the CDATA portion has changed several times in minor releases, so when our server team upgrades, I get to figure out the new syntax.

I thought XML was proposed to avoid these things! Haha. Again, tools in the hands of "wise men" like me are dangerous. I am probably as ignorant as them, I just think I know better!


Enclosing the attribute within double quotes isn't pre-disposing the value to be of a particular type. It's part of the XML spec that attribute values are contained within double quotes, and must be to be valid. The type isn't implied in the file.

An xml schema such as <xs:element name="Task"> <xs:complexType> <xs:attribute name="ID" type="xs:int" use="required" /> </xs:complexType> </xs:element>

could more explicitly declare the type of the value.


Thanks for the explanation. I guess in this case I learned be careful what you wish for. I guess this is why I prefer the

<Item> <Parameter><Data></Parameter> </Item>

But this is my ignorance of XML and familiarity with HTML showing.


XML as a config file format was a disaster in every example I ever encountered. Config files are supposed to be editable by humans using editors, and most that I saw were too complex for that. In particular the NeXT/Apple property file formats are horrible abuses of XML.

As a format to represent structured data, it could be fine as long as you were pragmatic about it. In the case of <task id="3"> you either assumed that "id" was always an integer or you validated it with a schema declaration, which quickly got hairy.

In practice I never validated XML beyond it being well-formed (which was provided by default in any parser) and never had any real problems.


What takes fewer lines of code to parse?

    <element.name id.value="3.14">
Or accepting both:

    <element.name id.value="3.14">
    <element.name id.value=3.14>
How would you specify an empty value for mandatory attributes?


I've seen empty values written as

    <tag attr1 attr2="val">data</tag>
Whether that's legal or not, I don't know.


not valid. wondering if you've seen that within HTML, where it is valid.


Actually, now that you mention it, I think it's from Chrome's Inspect Element tool, but I can't check right now.

I think if you wrote something like

    <div class="">...</div>
it would display in the tool as

    <div class>...</div>


Chrome's Inspect Element shows you the non-serialized DOM structure, which means it's neither XML nor HTML at that point.


Oh, this is the difference between attribute-oriented XML, element-oriented XML, and whatever-the-hell-we-feel-like-oriented XML. Publishers should pick one of the first two and be consistent about it.


Agree. Practical/pragmatic use of XML as a data format requires consistency.


I have a system that has things like <Task ID="6">Blah</Task>. Why is the ID, clearly always an integer in every sample of hundreds I see, represented as a string?

You're really asking a different question here: "Why should an integer be used as a task ID?" Storing the task ID as a string may give you options in the future that you wouldn't otherwise have, at a relatively small cost in parsing performance and validation overhead.

Most of the world's regrettable XML schemas were faulty at the specification stage, not the implementation stage. To minimize the likelihood of eventual regret, I usually prefer to store stuff in strings unless there's a very good reason not to. The fact that I'm using XML means that I'm not that concerned about performance, so... strings, it is.

A similar argument can be applied to the child/attribute dilemma. If there's even the slightest chance that a field isn't always going to be a leaf node, I'll do the extra typing and make it a child. Ideally the parser would be written to make them both work the same anyway.


I see were you were downvoted, but I happen to see merit with your comment. Again, a lot of people make technical decisions without stepping back and just scanning their choices as non-specialist (in the context of their programming domain) and ask hey, does this make sense?


Technically all attributes are supposed to be surrounded by quotes regardless of how they're interpreted. That renders the premise of my whole comment invalid, to be "technically correct," so the people downvoting may have had that in mind.

Still, there are plenty of XML applications that leave out the quotes on numeric attributes. My point was really that they're not doing themselves any favors by abusing the spec that way. A text-based markup language is a great example of how premature optimization is unhelpful most of the time.


> JSON has become popular because a lot of bad programmers saw nothing wrong with calling eval on untrusted input (before JSON.parse was available).

Disagree. JSON became popular because it was extremely easy to implement (both for marshaling and consuming), and because it was extremely lightweight.

I think you could also make the argument that JSON was conceptually easier for programmers to wrap their minds around. You could just pretty-print it and quickly get an idea for the object's format, attributes, etc.


I agree, especially with the easy to understand part.

Look how short the standard is: http://www.ecma-international.org/publications/files/ECMA-ST... It's small and perfect, like a 2x1 LEGO block.

Here's the XML spec: https://www.w3.org/TR/REC-xml/ <backs away slowly>


XML could be fairly lightweight also. It was all the enterprisey-standard formats that were hideous.

E.g.

    {"name":"John","age":42}
vs.

    <person name="John" age="42" />


Now do the nested objects in both. One line does not show much.


    <person id="123" name="John" age="42" sec:checksum="...">
      <family-member type="spouse" ref="456""/>
      <family-member type="child" ref="789" />
      <fin:credit-rating score="A"
          last-change="2016-02-04T12:34:56Z" />
      <уфмс:статус значение="42" />
    </person>
Here we can describe `person/@id` as element ID and `family-member/@ref` as a reference to an ID so our XML tools can link these together.

Also note three more elements from different namespaces: `@sec:checksum` could be some kind of technical information about the record, `fin:credit-rating` is added by the finanical module. The `@last-change` is defined as datetime so as we read it with other XML tools we'll get it as datetime type.

The next one is a tag in Russian language that describes something related to Russia; XML can use the whole Unicode in tag and attribute names.

Also, XML names are globally unique by design so there's no clash between all the different pieces and the tools can easily be configured to ignore parts they don't understand or work as a glue between different areas.

We can still efficiently validate the syntax the whole piece or parts of it as we see fit.


> Disagree. JSON became popular because it was extremely easy to implement (both for marshaling and consuming), and because it was extremely lightweight.

A canonical S-expression parser is strictly easier to implement, given that S-expressions consist only of lists and byte sequences (no numbers or objects), and is even more lightweight. JSON's big advantage was that it was familiar to a JavaScript programmer, that's all.


S-expressions is basically no syntax. Human-readability depends solely on the person that comes up with the schema. I mean there's many reasons to love S-expressions but human-readability is an unusual one. edn [0] is an interesting compromise (as is clojure).

XML is actually IMO not that bad at human readability, it's pretty good. It's terrible at human writability. Conversely S-exps are lovely to work with.

[0] https://github.com/edn-format/edn


XML's bad rep for verbosity is almost entirely due to the nonsensical, terrible idea of requiring names in the end tag. Without that, it's about the same level of verbosity as JSON. And personally, after writing plenty of both by hand, XML is easier to get right. JSON, with it's poor quoting rules (mandatory quotes on names??) and lack of comments is very annoying to do by hand and seems visually more noisy.


An advantage of names in end tags is human readability. Consider this XML fragment:

  <a>12<b>34<c>56<d>78<e>90</e></d></c></b></a>
Appending something to the end of the d element is easy, since one can just search for its end tag. In JSON and other formats that only have one single character at the end, one has to count brackets or parentheses for this purpose:

  (12(34(56(78(90)))))


If they're all <a> then you're back to square one.

JSON solves this with indentation, pretty printing, and using paired symbols that most conpetent editors can automatically balance. This solves the homogeneous case too.

Incidentally, XML can benefit from the first two, and many editors balance tags, so you can get the same thing there.


It is rare in real-world XML that elements have children with the same type. Do you have a (non-divitis) example where the tags are all the same?


It happens with any tree structure. E.g. I used to work on a system that managed reinsurance contracts and represented them as trees of contracts.


Did the elements often have immediate child elements that had immediate child elements (and so on) of the same type? Like:

  <contract><contract><contract><contract> […]


No, there were a couple of layers in that case. But that doesn't actually help you add a child at the correct level, because the end of a contract would look something like:

                ...
                </contract>
              </subcontracts>
            </content>          
          </contract>
        </subcontracts>
      </content>
    </contract>


> I think that XML's bad reputation comes from the fact that it is <adverbial-particle modifies="#123">so</adverbial-particle> <adverb id="123">incredibly</adverb> <adjective>verbose</adjective>.

> Also, the whole child/attribute dichotomy is a huge, huge mistake.

Those two factors run counter to each other. Attributes decrease verbosity, compared to child elements.

I agree, though. A few changes would make XML closer to ideal: eliminate attributes and eliminate the name in closing tags (<tagname>value</>), which makes child elements much less verbose, and reduces the need for attributes.


> A few changes would make XML closer to ideal: eliminate attributes and eliminate the name in closing tags (<tagname>value</>), which makes child elements much less verbose, and reduces the need for attributes.

Then just change '<tagname>' to '(tagname,' and '</>' to ')' and you'll have S-expressions.

Consider this:

    (feed
     (version 1)
     (title "Example Feed")
     (link http://example.org/)
     (updated "2003-12-13T18:30:02Z")
     (author (name "John Doe"))
     (id urn:uuid:60a76c80-d399-11d9-b93C-0003939e0af6)
    
     (entry
      (title "Atom-Powered Robots Run Amok")
      (link http://example.org/2003/12/13/atom03)
      (id urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a)
      (updated "2003-12-13T18:30:02Z")
      (summary "Some text.")))
That is a canonical S-expression (for a Scheme or Common Lisp reader, just quote the URIs too) version of:

    <?xml version="1.0" encoding="utf-8"?>
    <feed xmlns="http://www.w3.org/2005/Atom">
    
    <title>Example Feed</title>
    <link href="http://example.org/"/>
    <updated>2003-12-13T18:30:02Z</updated>
    <author>
    <name>John Doe</name>
    </author>
    <id>urn:uuid:60a76c80-d399-11d9-b93C-0003939e0af6</id>
    
    <entry>
    <title>Atom-Powered Robots Run Amok</title>
    <link href="http://example.org/2003/12/13/atom03"/>
    <id>urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a</id>
    <updated>2003-12-13T18:30:02Z</updated>
    <summary>Some text.</summary>
    </entry>
    
    </feed>
I particularly like how URIs are sometimes encoded as attributes and sometimes as child text elements.

And compare to your proposed version:

    <feed>
    
    <title>Example Feed</>
    <link>http://example.org/</>
    <updated>2003-12-13T18:30:02Z</>
    <author>
    <name>John Doe</>
    </>
    <id>urn:uuid:60a76c80-d399-11d9-b93C-0003939e0af6</>
    
    <entry>
    <title>Atom-Powered Robots Run Amok</>
    <link>http://example.org/2003/12/13/atom03</>
    <id>urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a</>
    <updated>2003-12-13T18:30:02Z</>
    <summary>Some text.</>
    </>
    
    </>
I think it's pretty clear which is the most readable and elegant.


If you're going to compare the two fairly, include appropriate indentation for both, not just the S-expression version. Also put the author and name tags on the same line, as you did with the S-expressions:

    <feed>
      <title>Example Feed</>
      <link>http://example.org/</>
      <updated>2003-12-13T18:30:02Z</>
      <author><name>John Doe</></>
      <id>urn:uuid:60a76c80-d399-11d9-b93C-0003939e0af6</>
    
      <entry>
        <title>Atom-Powered Robots Run Amok</>
        <link>http://example.org/2003/12/13/atom03</>
        <id>urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a</>
        <updated>2003-12-13T18:30:02Z</>
        <summary>Some text.</>
      </>
    </>
That said, I like S-expressions too, and I wish more parsers and tools existed for them, such as schemas, query tools, and simple transformation tools.


> If you're going to compare the two fairly, include appropriate indentation for both, not just the S-expression version.

When I pasted it in from https://validator.w3.org/feed/docs/atom.html#sampleFeed I guess I lost the indents. No idea why: they are clearly there in the original.


> I particularly like how URIs are sometimes encoded as attributes and sometimes as child text elements.

I think the distinction here is that the one is an identifier which is not intended to be dereferencable, and the other is a link to a resource which has to be retrievable. In the good old days the id would most likely have been a URN and the link a URL, but that distinction was being discouraged in favour of the more general URI term at the time the Atom spec was developed. [1]

So while they're syntactically both URIs (well technically IRIs), they're functionally quite different. It may be debatable whether that's a good enough reason for the one to be an element value and the other an attribute value, but I don't think that decision was obviously wrong.

[1] https://tools.ietf.org/html/rfc3986#section-1.1.3


The second and third examples do not have namespaces.

How would you include an HTML summary, for example?


> The second and third examples do not have namespaces.

> How would you include an HTML summary, for example?

As a text attribute, honestly — which would be necessary in XML as well (you could embed XHTML in XML, but not HTML). And in the general case, embedding one variant of XML inside another, rather than embedding a character-encoded variant of XML inside another, doesn't seem all that useful. How often do transforms need to reach all the way in like that?

I guess it's cool if it's possible, which is why I like S-expressions all the way down. But I don't think it's all that useful, as opposed to neat.


> How often do transforms need to reach all the way in like that?

In my experience, almost every time XSLT is used on real-world documents, those are documents with multiple namespaces. XSLT stylesheets themselves are also documents that have multiple namespaces. Example: Atom feeds often contain XHTML content. It is a common problem with RSS that it does not specify if the content of an element is HTML or plain text.

I have found that arguments that doubt a feature is necassary from people who can not imagine use cases are almost invariably wrong, while arguments that doubt a feature is necessary from people who list use cases and why they think those are better solved otherwise or even left unsolved are often right. Your post seems like an example of the former; would you say that complex real-world content with namespaces could sway you in favor of them?


I would be convinced if I saw real-world examples where having namespaces gave an advantage over not having namespaces. I can see the value in specifying whether the content of a given node is XHTML or text. I can at least theoretically see value in allowing nesting XHTML without a layer of escaping. I can't see any non-theoretical way in which namespaces are necessary to accomplish these things.


Example: The XSLT stylesheet for this Atom feed generates a web page for each entry: http://news.dieweltistgarnichtso.net/notes/index.xml In this setup, the Atom XML for each entry is generated from XHTML with XSLT, which makes it possible to automatically include an Atom enclosure element for every XHTML media element. To publish a podcast episode, it is enough to add a post with an <audio> or <video> element, as an XSLT stylesheet can “reach into” the XHTML content.

Namespaces are also widely used in SVG, which uses the XLink specification for hyperlinks and can embed XHTML and MathML content. Since SVG can be embedded in (X)HTML, this means you can have an ATOM feed containing XHTML containing MathML and SVG that contains XHTML and all have it displayed correctly.


> Example: The XSLT stylesheet for this Atom feed generates a web page for each entry: http://news.dieweltistgarnichtso.net/notes/index.xml

> In this setup, the Atom XML for each entry is generated from XHTML with XSLT, which makes it possible to automatically include an Atom enclosure element for every XHTML media element. To publish a podcast episode, it is enough to add a post with an <audio> or <video> element.

Sure. Why do you need namespaces to do that? Why couldn't you do it in XML-without-namespaces (or even JSON and some theoretical JSON-transformation-lanugage?)

> Namespaces are also widely used in SVG, which uses the XLink specification for hyperlinks and can embed XHTML and MathML content.

Again, why are namespaces necessary though? Why not just have a tag whose content is specified to be XHTML/MathML ? Wouldn't you want that anyway for the sake of human readability?


XML without namespaces does not exist. If it existed, how would you differentiate between title and link elements in Atom and title and link elements in XHTML? They have the same element names, but do not have the same meaning and therefore must be processed differently. Namespaces ensure that any XML processor can know the language of each part of the input.

Namespaces actually are the general mechanism with which you can specify that content is in another language: If you look at the feed source code, you can see that XHTML content is started with <div xmlns="http://www.w3.org/1999/xhtml"> and ends where that div element is closed.

Having an element with the semantics that “this content is in another language” is done out of necessity in HTML, as it has no namespacing: <style> elements contain CSS, <script> elements contain JavaScript, <svg> elements contain SVG … having an element in each language to embed each other language would become complicated very fast.


> XML without namespaces does not exist. If it existed, how would you differentiate between title and link elements in Atom and title and link elements in XHTML?

By where it is in the structure. The document is a tree where each element has well-defined context; there should never be confusion about whether a particular <title> is part of the feed or part of the content in the feed, because if it's in content it will be inside the content tag.

(Don't you need to do that anyway? I mean what if the XHTML had another Atom feed embedded in it? Or the content of one of the entries in the feed was another Atom feed? That's legitimate, but you wouldn't want to show titles from the "inner" feed as titles in the feed).

> Having an element with the semantics that “this content is in another language” is done out of necessity in HTML, as it has no namespacing: <style> elements contain CSS, <script> elements contain JavaScript, <svg> elements contain SVG … having an element in each language to embed each other language would become complicated very fast.

Only if you need the ability to embed an arbitrary other language. And if you do need that you can't possibly be validating or transforming based on what's embedded, so what value is the namespacing of it giving you?


You may have incomplete documents (e.g. documents with conditional sections, very much like XSLT):

    <code:if test="...">
      <!-- whatever -->
    <code:else>
      <!-- whatever -->
    </code:if>
Here you'll first process your code part an copy the contents as they are and then process the contents; but in the source document the two languages are interspersed.

Or you may want to extend your text format with, say, literate programming and add code fragments and files. In my homegrown system it's like that:

    <literate:fragment id="..." language="...">
      <text:caption>...</text:caption>
      <literate:code>...</literate:code>
    </literate:fragment>
My text system already has a notion of captions so there's no need to add my own "literate:caption" here. Yet the other two "literate" elements are new an unique. Also, using a namespace here ensures that I'm sure not to have a clash if the base system adds their own "fragment" or "code" blocks.


OK, I guess that takes things a level up. I don't like that kind of interspersed style and I don't think incomplete documents should be the same kind of thing as complete ones (e.g. one can't meaningfully validate your first example, because what if the "whatever" is an element that has to be present exactly once). But I can see that if you want to write things this way then namespaces help.


“I don't like” seems to be an æsthetic argument, not a technical one.


> The document is a tree where each element has well-defined context; there should never be confusion about whether a particular <title> is part of the feed or part of the content in the feed, because if it's in content it will be inside the content tag.

In this specific case, maybe – but generally, it is not true that you can infer the namespace of an element from context. Also, elements can have multiple attributes with different namespaces (and often do).

> I mean what if the XHTML had another Atom feed embedded in it? Or the content of one of the entries in the feed was another Atom feed? That's legitimate, but you wouldn't want to show titles from the "inner" feed as titles in the feed

That actually appears to be a bug in my stylesheet. Thank you for bringing it to my attention!

Programs often use namespaces to provide metadata. Here is an SVG I created with Inkscape that uses six different namespaces for metadata: http://daten.dieweltistgarnichtso.net/pics/icons/minetest/mi... Thanks to namespacing, web browsers can display the picture while ignoring Inkscape-specific data.

> Only if you need the ability to embed an arbitrary other language. And if you do need that you can't possibly be validating or transforming based on what's embedded, so what value is the namespacing of it giving you?

It is very useful to embed any arbitrary language, as XML processors can preserve the content they do not understand without processing it. My XSLT stylesheet would have no issue with SVG embedded in XHTML, just as your web browser most likely ignores everything about the SVG linked above it can not understand.


> It is very useful to embed any arbitrary language, as XML processors can preserve the content they do not understand without processing it. My XSLT stylesheet would have no issue with SVG embedded in XHTML, just as your web browser most likely ignores everything about the SVG linked above it can not understand.

Sure, but you can ignore extra attributes in JSON or hypothetical XML-without-namespacing too. I feel like there's an excluded middle here: either the content of a given tag has to be, say, SVG, in which case the validation schema for the outer document could just say (in a structured way) "the content of this tag must be a valid SVG document according to the SVG schema", or the content is some opaque arbitrary XML document, in which case there's no meaningful validation to be done.

Even when working with something like XHTML-with-embedded-SVG, I found myself wishing there was a way to strip the namespaces, run my xpath queries / xslt transformations on the stripped version, and then put the namespaces back; I think I'd've got my actual business tasks done a lot quicker that way.


Ignoring other attributes in data formats without namespaces is not as easy. What if one language is embedded in another and each one has a title element?

I do not know why you “feel” that way about the middle you want to exclude. It has been proven to be very useful in practice for me. Also without it, XML would not have the “extensible” property.

The way you describe working with “XHTML-with-embedded-SVG” reads to me like there is something about namespaces or your toolchain that you have difficulties with. I found that with XML-based systems, especially XSLT, it is easy to make a task needlessly complicated if one does not understand the details.


The creators of XML were aware that it was verbose; they mention in their design goals that this was the least priority.

Child and attribute "dichotomy" is not a mistake. What you mean is that these two samples appear to be equivalent:

    <foo value="123" />
    <foo>123</foo>
But they are not equivalent. The first line (with an attribute) is there solely for the computer. When the document is rendered, the human user is not supposed to see anything there unless the computer adds it.

The second line (with text content) is there for both the computer and the human user. The text "123" is for the human user; the fact that this text is something called "foo" is for the computer. When the document is rendered, the human user will see "123" here. Maybe computer will enhance something or maybe it will just use it as index or reference, whatever.

Most people who don't like XML seem to only encounter it in config files. In config files there's normally no content that needs to be there for the end users, so all data can happily go into attributes. The text content starts to matter when we deal with natural language texts.


> The creators of XML were aware that it was verbose; they mention in their design goals that this was the least priority.

Which seems pretty wasteful.

> Child and attribute "dichotomy" is not a mistake.

It's not for a markup format — as I mentioned, it can make sense there — but, as you mentioned, it doesn't make sense in a config or data file format.


The problem is that XML maps badly to data structures in common programming languages. JSON maps perfectly to structs and datastructures as lists/arrays/maps.

S-expressions are good if you work with Lisp like languages, but I don't think they're very readable if you're not into Lisp. I also can't see how they map easily into datastructures of imperative programming languages or even statically typed functional programming languages like haskell.


> S-expressions are good if you work with Lisp like languages, but I don't think they're very readable if you're not into Lisp.

Take a look at https://news.ycombinator.com/item?id=12198581; I think it demonstrates how readable one dialect of S-expressions can be.

> I also can't see how they map easily into datastructures of imperative programming languages

JSON consists of numbers, strings, booleans, objects and arrays; canonical S-expressions consist of bytes and lists. I contend that one can easily encode strings, numbers and booleans alike as bytes, and both objects and arrays as lists. Consider:

    {
        "id": 1234,
        "isEnabled": true,
        "props": ["abc", 123, false],
    }
This could be encoded in canonical S-expressions as:

    (object
     (id "1234")
     (is-enabled "true")
     (props (abc "123" "false")))
Granted, one still must convert the strings "1234," "true," "123," and "false" into the expected types, but with JSON one still must check the expected types anyway; it's not that big a difference.

And I honestly think that the S-expression version is far more attractive.


You could do make it more like S-expressions in JS if you really wanted.

    {object: [
      {id: "1234"},
      {isEnabled: "true"},
      {props: ["abc", "123", "false"]}]}
Not quite the same, but nothing keeps you from parsing an array of key/value pairs instead of a hash.


You may not leave JSON object properties unquoted, so it'd have to read:

    {"object": [
      {"id": "1234"},
      {"isEnabled": "true"},
      {"props": ["abc", "123", "false"]}]}
So you have extraneous quotes, extraneous semicolons, extraneous commas, plus the parsing code is complicated by having to handle all of that rather than atoms & lists (that's not a strong reason, since parsing code is written once and used millions of times).

I really, really don't get the visceral opposition to S-expressions. From my perspective they're both better & simpler.


There is a very big difference - "with JSON one still must check the expected types anyway" is not really true, I can deserialize an arbitrary json and I will know the difference between 123 and "123" even if I don't know what's expected or, alternatively, mixed-type values are expected.


> There is a very big difference - "with JSON one still must check the expected types anyway" is not really true, I can deserialize an arbitrary json and I will know the difference between 123 and "123" even if I don't know what's expected or, alternatively, mixed-type values are expected.

You will still need, in your code, to handle both 123 & "123" (or handle one, and error on the other). That's really no different from, in your code, parsing "123" as an integer, or throwing an error.

In JSON one must check that every value is the type one expects, or throw an error. With canonical S-expressions, one must parse that every value is the type one expects, or throw an error. There's really no difference.

If one is willing to use a Scheme or Common Lisp reader, of course, then numbers &c. are natively supported, at the expense of more quoting of strings (unless one chooses to use symbols …).


> You will still need, in your code, to handle both 123 & "123" (or handle one, and error on the other). That's really no different from, in your code, parsing "123" as an integer, or throwing an error.

It is different because in the latter case you have to write your own code to do it, while in the former your library will handle it for you.

> If one is willing to use a Scheme or Common Lisp reader, of course, then numbers &c. are natively supported, at the expense of more quoting of strings (unless one chooses to use symbols …).

So this format comes in dozens of partially-incompatible variants? Lovely.


> "The best human-readable data transfer format is probably canonical S-expressions"

I personally think TOML is a bit more readable...

https://github.com/toml-lang/toml


For configuration files, not for data serialisation.


Let's put it like this... what can you express in JSON that you couldn't express in TOML?


I can cleanly parse JSON, serialize it, and be confident I haven't lost anything. That can't be done for a language that allows comments without complicating the AST.


YAML is more readable than TOML though.


XML is very often the least bad format (compared with ASN.1, JSON, X12 EDI, CSV, and other interchange formats), particularly when dealing with statically typed languages. XML is a horrid chimera of SGML but at least it is both human readable, subject to machine validation, and gets the job done.


Oh EDI... you made me shudder.


No experience with ASN.1?


Nope. I ran into EDI when I was doing an integration for a JIT Hub that had to integrate with Hitachi and Seagate inventory systems. It was pretty awful to work with but the protocol was rock solid.


Well, XML is complicated, so it's hard to build support for, and it's verbose, so it's heavy on the wire. Frankly, I think JSON is a better format in most contexts.


The biggest problem with XML is that it's a node labeled tree that makes the schema choice between leaf node and attribute for scalar data almost arbitrary, whereas JSON is an edge labeled tree without the same choice. Most programming languages use edge labeled graphs for in memory data structures, so the semantic distance is lower with JSON.


Indeed. Furthermore, JSON readily differentiates between a single element {a: 'hello'} and a vector with one element {b: ['hello']}, as do most programming languages. XML does not, which leads to weird constructs like <Names><Name>a</Name></Names> to indicate that more than one name is possible. (Except .. if you actually use a schema with your XML parser, that indicates more than one is possible. But almost no one does). JSON also differentiates numbers from strings, etc.

As a result, in my experience, JSON tends to be more robust in real world use - even when a schema is available.


Can you explain what you mean by JSON being an edge labeled tree in more detail? I don't understand and would really like to.


Taking a stab at this...

Let's say we have a dog who has four paws. In XML:

    <dog>
      <paw health="ok">
      <paw health="ok">
      <paw health="ok">
      <paw health="ok">
    </dog>
In JSON:

    { "paws": [
        { "health": "ok" },
        { "health": "ok" },
        { "health": "ok" },
        { "health": "ok" },
    ] }
I think what the GP is getting at is that JSON is always describing the relationships between a thing and another thing, rarely the things themselves. In the JSON version, for example, it can be assumed that an object in the "paws" array is a paw.

This example is sort of a straw man. The JSON version could be wrapped with { "dog": {...} } and the individual XML paws could be wrapped in a <paws> element. But in any case, JSON doesn't need you to give an explicitly label the type of each paw, just what they belong to and what's known about them.


I'll step up to the plate to give a more technical answer.

JSON is an edge labelled tree, XML is node labelled tree. Let's see what that means, but first, let's talk about what nodes, edges, trees, and labels. You may already know, but I don't want to make no assumptions.

First, a tree: A tree is a datastructure with nodes, which reference other nodes, and each node is only referred to by one other node. Now, XML is obviously a tree, with each tag being a node:

       <dog>
       /| |\ 
      / | | \
     /  | |  \
  <paw> | | <paw>
        / \
       /   \
      /     \
    <paw>  <paw>
However JSON is also a tree: however, instead of tags, we have arrays and objects:

Well, actually, I'm not going to draw that. I'm typing on a phone, and it was hard enough making that last one. So, you know, just imagine it. And if you imagine hard enough, you just might notice that this graph is edge labelled, rather than node labelled.

A node, as you may recall, is just a thing on the tree, like a tag, or an object or an array. Queue the music!

  TO THE TUNE OF "NOUNS" FROM SCHOOLHOUSE ROCK:
  Oh any list through which you can go (like a array, a linked list, or an arraylist),
  And any structure that you can show (like a hashmap, or a struct),
  If they have pointers you can follow (from an object in a tree),
  You know they're nodes, you know they're nodes
Aaaanyway, an edge is the link between two nodes, and labels are just names.

JSON labels edges: "I want the first value in the array you got at key "foo" from the root object."

XML labels nodes: "I want the paragraph tag with the id of 'foo' inside the body tag inside the html tag."

You see, with the JSON, the nodes themselves didn't have labels, just the links between them: With XML, it was the opposite: There was no name for the links, instead there were names for the objects.

GGP reminds us that most programming languages do it the same way JSON does (when was the last time referred to the Foo object in the Bar object in the head Baz object when coding?), and so JSON maps better to the kind of datastructures we use most of the time.


I don't quite see why you can't do the same with xml; maybe it needs some more typing, but it is expressing the same thing.

  <paws>
    <health status="ok"/>
    <health status="ok"/>
    <health status="ok"/>
    <health status="ok"/>
  </paws>
i thought that the main advantage of json was that it can be used as is (code as data) in javascript, but the problem here of course is that without a parser/validator one can inject tons of malicious code. If you are not on javascript then you can't do without a parser / in memory tree structure - and that's the same DOM model once again.

Json needs a bit less typing, now is that really such a significant difference? i think that adoption in matters of markup is more like a fashion - once people got the hang of it then it seems natural and goes without explanation.

i would say that there is one major difference - binary or text; as long as its text then it doesn't quite matter how you structure your markup; if you need your data to be of small size then you will have to compress it; however the parsing of a text tree will usually take more time than the serialization of a binary structure (by several factors);

Therefore you will use text markup where application speed is not very important, or where speed of development is more important than application performance, or you will use it for complicated configuration data (and your users will hate you because a name value format like ini files is easier to handle - well, mostly)


> I don't quite see why you can't do the same with xml; maybe it needs some more typing, but it is expressing the same thing.

It's not idiomatic though - the dog's paws aren't "healths". The point is that in XML each tag is expected to have a label and be an entity in its own right, whereas in JSON you expect each field to be an attribute.


Graph theory terminology: nodes are connected by edges. When drawn, the edges are the lines, the nodes are the blobs that are connected by lines.

All trees are graphs. Not all graphs are trees; there could be cycles, or children with multiple parents in an arbitrary graph.

In JSON, the nodes are literals: numbers, strings, booleans, arrays, hashes (object constructors). The edges are hash keys (object field names in the constructors).

In XML, the nodes themselves have the names. The edges are implicit in the syntax via containment, and are unlabeled.

In programming languages, generally our values don't have names. Instead, our variables have names, and refer to values; variables can be assigned different values, but the name doesn't change. More physically, if the values are stored on the heap, variables are pointers to values on the heap, and fields of heap objects are further pointers to more values on the heap. Here, variables and fields are edges, and the values are the nodes. Looked at from a graph theory perspective, the in-memory model is an edge labeled graph.


Indeed. Even translating to Lisp, which has closer datastructures than most languages, the XML is translated to an edge-labeled tree.


Great explanation, I'd never heard it put that way before, thanks!


What do you think JAVA stands for? It's not an abbreviation. It's the name of an island and it's just 'Java'.


It's more accurate to say that it's named for the coffee beans that come from said island.


It's even more accurate to say that it's named for the coffee made from the coffee beans that come from said island. ;-)


Ruby has had an included XML library since before Rails was released. soap4r is older than Rails too. I wrote my share of clients for SOAP services back then. soap4r wasn't fun to use but it mostly worked. If the service was really simple (a single call and response, for instance) it was sometimes more expedient to put together the request yourself.

When Savon came out 6-7 years ago it was a huge relief. Luckily, by that point, I was seeing a lot less SOAP. But even with Savon, the experience was only lifted to "not awful", never to "wow, I'm glad they used SOAP, this is so easy."


My experience with early Ruby XML parsers is that they were all "how hard can this be?" hacks someone did over a weekend by people who didn't really use XML or understand the ecosystem of specifiations and thereby barely worked and often didn't support fundamental things like namespaces correctly. It took away everything which made XML powerful and left you with something that was often even finicky.


Yeah, I worked on a Rails 1.x app which integrated with an xml feed. IIRC, the options were a library that used regexes internally and had horrible performance (but might have been fine for rss feeds), and another library which wrapped a C library and used callbacks[1]. Definitely was a huge pain point and probably was a mistake for me to use the latest hep environment for that app, but for the rest of Rails it might have been worth it.

[1] I think it might have been http://www.yoshidam.net/Ruby.html#xmlparser


>well regarded in the enterprise

I think this alone should be enough to cast doubt on it, based on my (albeit limited) interactions with "enterprise" software.

>I think it's bad reputation comes from anyone not using an enterprise language because the support just isn't there.

What, like JavaScript? I've had to read and write XML packets from a Node app to work with (surprise!) an enterprise app. I had probably 20 choices of libraries with varying levels of features, and the one I chose worked fine.

I was lucky, compared to some of the others on this page: The "RPC"-style XML commands and responses I had to parse and generate were all well standardized, so I just wrote a wrapper that extracted the completely opaque tree of XML into a flatter JavaScript object/hash that was really easy to deal with, and similarly made a wrapper that would trivially generate the monstrous XML required to send commands and responses back to the server. My JSON-equivalent objects were easier to manipulate (and would also have been easier to deal with in Java or, in this case, C#), equally rich in the information they carried, but could have been serialized with 1/3 the number of bytes per message. Totally a win-win-win.

What I don't understand is why anyone thought using XML that way was a good idea, and why it still is popular in the enterprise. Bad habits are hard to break, I guess.


> What I don't understand is why anyone thought using XML that way was a good idea, and why it still is popular in the enterprise. Bad habits are hard to break, I guess.

Namespaces, which then gives you easy answers for Internationalisation (xml:lang), a subject-predicate-object data structure (RDF), which can lead on to logical meaning/modelling of data (RDFS/OWL), which then lets you look at harder questions like trust/provenance.

There's also schema validation (XSD), transformation (XSLT), which then provides you tools like XPath.

Most of that is on the front page for the technology: https://www.w3.org/standards/xml/

The real problem is not syntax, its communication between groups with differing experiences and interests - how do I know your messages mean the same thing as what my system expects?

If you prove to be malicious, do I have to write a strict validator before I trust your input?

If you want to ensure your messages are well formed before they are sent, do you also have to write a validator?

How do I know our validators are checking the same things?

If you want to send a large document oriented data structure, but I only care about a specific section relating to my interests; do I have to understand where to look and what all of the surrounding material is; or can I query for the relevant bits?

On the more complicated RDF side of things - if you want to share identifiers with me, how do we both avoid calling everything record id=1?

If we are both talking about the same thing but know different parts of the story, how can I recognize your information as describing the same thing I know about?

If we both know about the same Thing, and know certain logical facts about that Thing, can we check those facts actually make sense against shared rules?

If we both know about the same Thing, and can see a logical inconsistency in data, can we reason about which data to Trust and why?

Unfortunately, communicating properly is hard even with all of the tools to help.

We tend to opt towards subjecting systems to an ongoing fuzzing test because we don't value many of the above things - we tend to work in organisations with a short attention span focused on the now and a narrow set of interests. It just kind of works for the 80% of the time, so we move on.

Contrast that with something like a library or museum, and you see why ideas like Dublin Core really catch on there.


Sounds great in theory. In practice it doesn't seem nearly as carefully implemented, and/or XML is used where it's actually not needed.

XML is designed to be a markup language. The fact that it has all of these other things bolted on doesn't actually make it a good generic data interchange format.

For things like RDF, maybe it's the best option we have, but that's not because XML is great, it's because XML was used in the only standardized option.

Looking at an example of xml:lang:

    <?xml version="1.0" encoding="utf-8" ?>
    <doc xml:lang="en">
     <list title="Titre en français" xml:lang="fr">
      <p>Texte en français.</p>
      <p xml:lang="fr-ca">Texte en québécquois.</p>
      <p xml:lang="en">Second text in English.</p>
     </list>
     <p>Text in English.</p>
    </doc>
...this is a nightmare. If I want to translate a document, the last thing I want to do is embed each translation inline like that. Almost certainly the best response is to "fork" the document at the highest level and include separate language versions of the document; otherwise, if you have 20 translations of the document, you need 20x the text in the document than any one reader will need.

Yes, XML gives you that particular hammer. But using XML results in a lot of sore thumbs.

Schema validation is nice to be sure. I'm using JSON Schema Validation myself [1] to verify incoming JSON, and I'm automatically generating those schemas from the TypeScript data structure specifications [2]. This is particularly good for a JavaScript language target, of course, but I find XML and XPath to be ugly or painfully slow in every language I've used it from, while JSON just has a better impedance match to data storage and interchange.

[1] http://json-schema.org/

[2] https://github.com/YousefED/typescript-json-schema


> What, like JavaScript?

No, JavaScript is not what I'd consider an enterprise language. I'm talking about C++, C#, VB.NET, Java, and LotusScript. Enterprise languages and enterprise applications (e.g. Siebel) have no problem talking to each other via SOAP/XML and they all produce WSDLs that are easily consumed by one another.

When using something like C# or Java you can easily import a WSDL from another application and the toolchain will automatically generate all of objects defined and properly serialize/deserialize XML into those objects. There's no need to write parsers or use sockets/webclients to talk SOAP.

Newer backend languages and frameworks (e.g. RoR, NodeJS, etc) don't have these mature and robust toolchains for XML/SOAP.


My impression is that the reason why XML is so well-regarded in the enterprise is because these companies are not aware of better alternatives, such as Protocol Buffers [1]. The reason why XML has a bad reputation outside of the enterprise is because it is so incredibly verbose (both the language itself and the code used for working with it), and that all-in-all, it is a sub-optimal solution to a solved problem.

To illustrate: Protocol Buffers' wire format is much more compact. It removes the complexity of having to deal with XML parsers by providing classes generated from the message definition/schema. You can use it with GRPC to implement your service APIs. It is supported for many different languages, including Java and C#. It now even has a JSON mapping [2]. Overall, Protocol Buffers can do everything XML can do as both an exchange format and as a configuration language but better.

[1] https://developers.google.com/protocol-buffers/

[2] https://developers.google.com/protocol-buffers/docs/proto3#j...


Protocol Buffers are just one of many proprietary serialization libraries. Regardless of technical excellence, Protocol Buffers and competing libraries are automatically much less suitable for actual enterprise use than open standard serialization protocols with multiple interoperable implementations, such as ASN.1. And of course, XML is usually preferable to ASN.1 or the like because it is equally standardized but it has an ample choice of implementations, advanced tools and human readability and writability.


Protocol Buffers is not proprietary. It is open source under the BSD license. Here is the source code: github.com/google/protobuf. It is very much an open protocol, and anyone is free to write their own implementation of it. It is just a standards-based protocol.

If your organization values the existence of a standard over technical excellence, then there is no use in convincing you. Otherwise, in terms of ease of use, performance, tooling, and human readability and writability, Protocol Buffers is superior to XML-based protocols (since the API for converting between the binary and text formats is extremely simple to use).

As a fun fact, if you really wanted to use XML as a wire format, you could even write an XmlFormat ser/de for Protocol Buffers, similar to the JsonFormat that is already provided, but then it would defeat one of the main purposes of using Protocol Buffers in the first place because you would replace an extremely performant wire format with an extremely sub-optimal one.


Protocol buffer isn't proprietary. It's just not a standards based protocol. But it doesn't stop you from writing code against it, and you can easily interop with a third party who is using protocol buffers.


Another part of it is that statically typed languages benefit much more from XML with a strictly defined schema like DTD or XSD because it makes it easier to generate the objects that you're going to have to map it into.

With a language like Ruby, PHP, etc that isn't strongly typed it's not nearly as big of a deal. Developers in those languages are used to assuming everything is a string and converting it to something useful without the need to premap every datatype.

That's probably the main reason that XML was so much more popular with the languages you mention compared to the parts of the ecosystem that didn't benefit from it's constructs much (if at all).


Some time back I needed to generate an XML file in a Java web application. I attempted to figure out how to do it "right". The only "special" requirement was that it is formatted in a readable way.

So I was figuring out the Java XML stuff (don't remember what that was exactly, probably standard). But at some point the timeout in my brain kicked in, and I just wrote a loop generating the XML by brute-force through PrintWriter or something. I even escaped strings right since some library I had available conveniently offered the escape method (Guava maybe?).


Back in the early days of XML, Internet Explorer would insert "+" characters to fold nested sections of XML. And was the default program to open .xml files. Guess what showed up in the documents I got from an integration partner?


It still does! and I get corrupted files like that mailed to weekly by integration partners. I may be wrong but I think FF also adds some crap to xml files when used as a viewer. I actually like xml, for some reason the structure of it makes alot of sense to me, while json is untidy and confusing


Guess what caused a serious outage of a system at a customer that I know, with an estimated impact on his bottom line in the seven-digit area? Yeah, right: naive copying of some XML out of IE into the configuration of said system. Including those '+' characters, which resulted in it not exactly being XML anymore.


I once got an XML file from an integration partner where the whole thing was XML escaped (all the tags looked like &lt;node&gt;value&lt;/node&gt;) because they had embedded it within an outer "envelope" XML file. They saw nothing wrong with this and argued when I questioned it. I wonder how they were planning to express escape sequences within the inner XML document that was already escaped...


It's ugly of course, but a parser should have no problem with &amp;amp; or &amp;lt;. It can go arbitrarily deep.


I think if you select/copied what was displayed you'd get the plus and minus signs. If you saved that wouldn't happen.


Compared to the problems when dealing with 'delimited text', XML is great.

Also it's flexible where you can specify properties as attributes or child nodes, depending on wildcard specifications.

So I have dealt with lots of edge-case XML situations, but the solutions are always straight forward. Also it helps to have a client vs. trying to parse out raw XML, which means programming and scripting sometimes relies on personal tool development. XML handles scope creep well.


Handling scope creep is my favorite feature. With XML, it's easy to deserialize even if an expected element is not there, or if there is an extra one you're not expecting, at least that's been my experience. I haven't done much JSON but I'm not sure how that would work with it.


Pretty much any "real" serialization format should handle that situation fine. Protobuf, JSON, YAML, Thrift, heck, even Java serialization can handle that, provided you set a serialVersionUID.


JSON deserialization would basically be the same. XML does not score here.


On the Cognicast there was an excellent tangent (all of them were good) in episode 106, where Michael Nygard bemoans with the fellow Cognitect Craig that, despite all the hate from the JSON generation, the failed promise of XML was the ability (again, that is part, not the whole) to have separated data and presentation with schemas, so you would not have to redesign endpoints all the damn time.

http://blog.cognitect.com/cognicast/106

This is just one view, and I am sure I will be mercilessly downvoted, as this is a gross simplification of that point, but it was one of many gems in that episode. I might finally review XSLT as this once again affirms things other devs told me when they said do not write off XML, in the complexity of it is something interesting.


I loved XML and XSLT. And Internet Exporer, for all its faults, had great support for XSLT in the browser from version 5. It was quite easy to build "rich" single-page apps that get XML data from the server and build various user presentations by updating DOM with XSLT.


The HTML for my blog is generated by applying an XSLT stylesheet to its feed.

You can see the stylesheet here: http://news.dieweltistgarnichtso.net/posts/atom2html.xsl

You can see the resulting web page here: http://news.dieweltistgarnichtso.net/posts/


I thought of that same exchange when I read this post but remembered it more as a lament that JSON doesn't support namespaces - so JSON is always context dependent.


This article in some ways describes the delta from HTML development to XML development. In the early/mid 2000s, XML was cargo-culted through the tech world on a massive scale; typically being adopted by web developers who proceeded to apply the same habits and tools for XML as they'd been using for HTML. Which of course resulted in many of the issues mentioned.


There's a popular piece of "newer" software that decided that XML rules were too difficult. So they URL encode all values. It also uses print style formatting for XML tag names, so if you manage to get a name value that has, say, a : in it, you'll get invalid tags. This is the default setup, in 2016, for a system that handles a lot of real-world telephone calls.

Even just a few years ago I've worked with companies that wrote their own "XML parser". They explained it was pretty easy but they had to "special case" for broken output in the real world. An example of this output? "<tag />".

HTML would have been far better off if it had the strictness of XML. Remove end tag names so you can't have invalid nesting. If browsers had refused to parse invalid docs from the start, invalid docs would not have been produced. (And like XML, they could provide decent error messages, so the difficulty would not be significantly raised.)


I used to hate doing XML in Python - ElementTree was the nicest of them 10 years ago, but it still hurt.

But last year, I discovered xmltodict[0] and since then, I don't really care - it makes doing xml (both reading and writing) no more cumbersone than using dicts, while still supporting stuff like namespaces, CDATA and friends.

I still think XML is a horrible, misguided idea - from inception, but even more so in how it is used in practice - but I no longer feel any pain interfacing with it.

[0] https://github.com/martinblech/xmltodict


Python has a very good lxml module for advanced XML processing. You can define your own classes for XML elements, so you can read an XML file and get your own classes for the underlying elements. They're somewhat limited, you can easily define methods, but the data is locked to what's in XML. You can also define your own XPath functions and XSLT extensions. Comes very handy sometimes.

The API is still rather awkward though.


I think a big problem with XML in most languages is the tooling around it. The libraries to parse/create it are not very pleasant to work with because of the immense complexity they have to deal with. If they only had to conform to a very small subset of all of XML's features and quirks, you'd have a very sane ecosystem.


There's really no reason to use UTF-16 but compatibility with older software (which is usually broken when handling surrogate pairs). It's an atavism from times when all unicode codepoints fitted into 16 bit.


I think that one boils to basically back in 1990, ISO 10646 wanted 32-bit characters but had no software folks on that committee, while the Unicode people was basically software folks but thought that 16-bit was enough (this dates back to the original Unicode proposal from 1988). UTF-8 was only created in 1992, after the software folks rejected the original DIS 10646 in mid-1991.


This reminds me of an interesting experience I had with XML at a pervious job a few years ago.

We had bought a product from another company which was to be integrated into our own main product. Theirs was horribly ugly, looking like a cross between a 90's website and an infomercial, predominately in vivid shades of pink and purple. And it was really buggy. I soon noticed that all the content (many hundred pages with text, video and interactive content) was specified in a giant XML file and that the application itself simply interpreted this file and presented it to the user. We quickly decided that the best course of action was for me to reverse-engineer this XML file and write our own code to generate an integrated version of it, presented in a visual style more in line with the rest of our own product. This meant we could also solve some of their bugs on the way.

I still feel this was the only reasonable option and it did work out within our given time frame. However, I will never forget the horrors I saw in that one file. A few gems included:

- The file was most certainly handwritten with lots of tag mismatches and spell errors in tag names.

- One of the main sections was missing in their own standalone version because of a syntax error which caused their program to skip over the entire main branch of the syntax tree in which it occurred.

- Exercises where you had to order a list of items were defined as dragging items into hit boxes on a static bitmap image of the numbers 1-10 on a purple background. The same image was used regardless of how many items had to be ordered. The hit boxes didn't align with those numbers at all and often overlapped. In their implementation, Items were stuck right where you dropped them, rather than snapping to a fixed position by the right number.

- We wrote a few tools to identify images and videos which were either present on disk but never referenced or vice versa. This was often a case of spelling errors, slight variations in word connotation or files placed in the wrong folder. In these cases, their original program would bail out and skip that page.

- Indices of chapters were written as plain text rather than inferred. They did not match how things were laid out in the XML and where it happened to align it was sooner or later broken by sections which were commented out or failed to parse.

There were many more issues, but these give some insight into the exciting challenge of getting their data to work in a consistent and logical manner. After the XML file had been thoroughly massaged into submission and uniformity, of course.


Please edit your post to eliminate the fixed-text:

- It will be easier to read.

- Reading won't require a lot of fiddly trackpadding.

- Maybe it would be nice if HN's simple markup system could handle the case in which the author wants a list of indented items, but it doesn't, and fixed-text is a poor substitute for that.

[EDIT:] Thanks!


There. I agree, it looked horrible


This is by no means totally bulletproof, but these C macros around libxml2 let us write nested well-formed XML expressions as code:

Example usage: https://github.com/libguestfs/libguestfs/blob/master/src/lau...

Macro definitions: https://github.com/libguestfs/libguestfs/blob/master/src/lau...


Totally, we took this a step further and created a subversion repository where xml documents describe classes. Each method is either inline, or is described by a xml element of a particular namespace that links to a subversion id and revision. ;)


Note: I believe this is a reference to http://thedailywtf.com/articles/the-inner-json-effect


Some XML dialects become very confusing if features are added as an afterthought without consideration of syntax and sematics. Microsofts Wordprocessing XML for example has caveats like w:permStart:

    <w:permStart w:id="0" w:edGrp="editor"/>
    (...)
    <w:permEnd w:id="0"/>
permStart and permEnd define regions where special permissions are required to edit a document. It is encoded in a complete anti-XML syntax, where different tags (and a common ID) represent the start and end of a region.


Microsoft Wordprocessing XML is very quirky :) I think they use these markers because different areas can overlap and thus you cannot express this with a tree-like structure.


There are many flavors of XML and JSON out there now. I think for many developers JSON started to "look good" when the number of standards that started stacking up against XML (and XML-ish/SGML-ish/HTML-ish based formats) started to make people go insane. In the healthcare world we typically had to deal with a never ending set of "format standards" that kept integrating themselves together. I guess originally that may have been the beauty of XML... we started with XML RPC, moving on to SOAP 1.0, SOAP 1.1 introduced new ways to send headers. At some point however it just went crazy.. I think kinda when the enterprise-level people got their hands on things, they started porting all of their non-standard wack-job features into XML.

WS-Addressing - ok seems simple, but now your SOAP stack has to support async processing. WS-Trust - OK Let's add a simple feature that lets you put "some tokens" in the request and response for security, auditing, non-repudiation - good ideas sure. WS-Eventing - Let's add enterprise queuing to XML and soap and require stacks to support that, let the users of the stack figure out a way to connect that to the queues.

Anyway the list goes on, and you can read about it here: https://en.wikipedia.org/wiki/List_of_web_service_specificat...

Suffice it to say, but XML died because the developer now had to learn all of these, how they worked because one tiny industry body starts to adopt 1% of each, requiring implementors to learn 99% of all. It basically just made JSON attractive, a reset if you will.

XML won't go away. HTML will continue forever (it crosses a developer-designer "human line" that makes it kinda permanent) Developers adapt to future technologies a lot faster than designers and other's dabbling in HTML.

Now all this being said, you can see the list of standards piling up against JSON. There's really no critical mass ready replacement though, so JSON will be safe for quite a while longer. JSON will only be replaced in various "areas" like YAML for config, binary JSON-compatible representations for wire and/or storage.

I'm not biased against XML for data transfer, but if someone asked me to create a SOAP 1.1 service with WS-Trust, SAML tokens, etc... I'd also argue for a more industry accepted REST service with OAuth tokens, simply because it would be like introducing the Hummer all over again in age where Tesla's are everywhere. - everyone would hate us.


XML is a perfectly fine format that was (ab)used dreadfully by many, many people to such an extent that many people only have examples of completely dreadful XML as their reference.

So many XML-as-interpreted-programming-language monstrosities out there (I know I wrote one as I had the perfect problem domain to use LISP but didn't have the environment capability to use LISP but did have a Database XML field to store 'data' in so I did XML-as-S-Expression with a SAX based interpreter - it was surprisingly nice).


Discussions about XML and JSON often remind me of this comment on HN: https://news.ycombinator.com/item?id=5702868

Partial quote:

> XML can certainly be shorter than JSON and often is, and repeated tags are the best showcase for it:

     <user id="abc">
        <phoneNo type="home">123456789</phoneNo>
        <phoneNo type="work">321654987</phoneNo>
    </user>
> This turns into this beautiful JSON:

     [
      "users": [
	{
	  "id": "abc",
	  "phoneNos": [
	    { "type": "home", "value": "123456789" }, 
	    { "type": "work", "value": "321654987" }
	  ]
	}
      ]
    ]


Not a fair comparison since the JSON case includes the outer list as well. And whenever I've seen the equivalent of this in a real-world XML format it would use a <phoneNos> tag to group the phone numbers together.


You probably have not looked too closely at real-world XML.

• Many XHTML and SVG elements can occur without dedicated wrapper elements.

• In Atom feeds, <author>, <category>, <contributor>, <link> elements can occur multiple times without a dedicated wrapper element.

• In XSPF playlists, <link>, <meta>, <extension>, <location>, <identifier> elements can occur multiple times without a dedicated wrapper element.


Had to post this old article because I encountered some bozo code again. Reading more about some CMS and planning on using it for my blogs when I saw the code of the RSS feed. It was written by the lead developer of the CMS and used text templates.


The way your comment comes across is a bit irritating. Not understanding the underlying codebase and classifying based on an attenuated knowledge of a topic promotes one to the 'bozo' status more quickly than not. Many systems use text-template based feeds, examples are Shopify, Salesforce, Wordpress, and more. Are these systems fundamentally broken purely because of this approach? Probably not. In your case, are the text templates escaping their values when outputting? Are they validating for correct XML once generated? Have more questions than having pre-defined answers.


You mention typical PHP projects written by people who think they know better than the likes of Tim Bray.

PHP, the language that made short tags a configuration option because they wanted to mix program code with XML.

PHP, the language with a lot of different escape functions because they didn't get it right the first time.


I also mentioned projects written in Ruby and Java, but that's ok. VB.Net also has XML Literals. Ha ha.


VB's XML literals are just shortcuts for creating the corresponding classes though, right? That's quite a bit different.


Was being a bit sardonic in my comment due to where the discussion went, but yeah, XML Literals in VB.Net create XDocument instances and are just like string literals except:

* Enclosing quotes aren't required * Assumed to be multi-line so line continuation characters aren't required * Are validated for being well-formed XML by the compiler (and at design-time, if VS) * Can have embedded expressions


The author of this post is a bozo, doing any (or not doing any) of the suggested things does not guarantee well formed XML. Disregarding whole sections of the XML spec, prescribing a certain way to generate xml are more harmful than not. Can text templates generate well formed xml, absolutely. Can tools generate non-well formed xml, absolutely.


> Making mistakes with them is extremely easy and taking all cases into account is hard.

He states why right there. He doesn't say anywhere whether templates can or cannot generate well-formed xml.


(I'm the author of the article.)

Today, it's clear that text/html has won over application/xhtml+xml and JSON has won over XML for most (non-enterprise) non-document uses. But back around 2003..2009, there was no shortage of people who advocated in favor of XML and got it wrong when writing it by hand or when generating it with text-based templates.

Philip Taylor (not to be confused with Philip TAYLOR) was one of the regulars on the #whatwg IRC channel around 2007..2009. He had a hobby of trying to get XML advocates' systems to produce ill-formed output. He pretty much succeeded every time. IIRC, he even found a bug in Validator.nu's XML output, even though Validator.nu practices what I preach in the article.

The easy way was to supply user input that contained U+FFFE and watch the output blow up with the Yellow Screen of Death when U+FFFE was echoed as-is. Unless you have a templating system designed with the warts of XML in mind, this will happen. (A proper XML serializer has to scrub the characters that aren't allowed in XML, as seen in https://hg.mozilla.org/projects/htmlparser/file/dd08dec8acb7... .)

He even found a bug in Tim Bray's code that was written to make the point that it's possible to generate XML correctly... (https://lists.w3.org/Archives/Public/www-archive/2009Mar/006...)


The sheer amount of sites that produce badly formed RSS feeds is staggering, the whole point of a feed is to make your content accessible to everyone, a bit like meta-tags. Why have it if you're not going to at least implement it properly?


I recently wrote a first pass at an RSS feed parser for podcasts, but couldn't find examples of interestingly malformed podcast feeds to test against. Do you have examples of sites with badly formed RSS feeds?


I had no idea there's such a beast called "XML 1.1". That sounds fun!


This reminds me of when I was just starting out as a programmer. I was doing contract work and needed to write a php json endpoint. I had no idea what I was doing and hardcoded it all with print statements. Yikes.


Why would anyone choose to use XML over JSON, other than for RSS?


I can parse/print XML (using either in memory parser or streaming parser), use XML Schema to validate XML, XPath expressions to select necessary parts, automatic object mapping, and that's all with standard library without a single external dependency in Java. I don't know why would I use JSON over XML, unless I have very good reasons to do so.

For me the only thing that JSON got better is that JSON is directly mapped to commonly used data structures: arrays and maps.


> For me the only thing that JSON got better is that JSON is directly mapped to commonly used data structures: arrays and maps.

This is nice, but it's also kind of a pain, as it makes you have to stop and think about which structured data elements it's capable of supporting and which you have to send your own metadata through the wire and then reconstruct on your own. For example: Dates. Which is a shame. If there is one data element I want the most help with serializing/deserializing, it's freaking Dates. All the other ones are super easy in comparison. There's just way to much subjective, dirty, human culture tied up in Dates.

The only thing that I think is objectively better in all cases about JSON over XML is the less verbose end-structure syntax. I think XML only has Attributes because Tags have this silly need to state their name both as they enter and exit the room.

    <Tag1 attr1="hello" attr2="world">
      <Tag2>how</Tag2>
      <Tag2>are</Tag2>
      <Tag2>you?</Tag2>
    </Tag1>
For simple, unnestable data elements, having the more efficient Attribute starts to look attractive.

If XML were more the form:

    <Tag1 attr1="hello" attr2="world">
      <Tag2>how</>
      <Tag2>are</>
      <Tag2>you?</>
    </>
It's actually only one more additional, required character to add than if the attribute value were specified as an element instead.

    <Tag1>
      <attr1>hello</>
      <attr2>world</>
      <Tag2>how</>
      <Tag2>are</>
      <Tag2>you?</>
    </>
Heck, why stop there? Do we really need to have quite all of those angle brackets now? How about we just get rid of all the ones we can assume:

    <Tag1
      <attr1 hello>
      <attr2 world>
      <Tag2 how>
      <Tag2 are>
      <Tag2 you?>
    >
And finally, who even likes angle brackets? I've never enjoyed the dual duty they play as delimiters in XML and operators in other languages. Let's use a common set delimiter, something like square brackets or maybe parentheses.

    (Tag1
      (attr1 hello)
      (attr2 world)
      (Tag2 how)
      (Tag2 are)
      (Tag2 you?))
Now where have I seen this before?

PS: JSON that is as nearly as equivalent as I can make it is not much less verbose than original XML, and requires some level of convention to make up for the differences:

    {Tag1: { attr1: "hello", attr2: "world", children: [
      {Tag2: "how"},
      {Tag2: "are"},
      {Tag2: "you?}]}
Though I'm sure in common practice it'd have a lot of the original metadata of the XML version thrown away:

    {attr1: "hello", attr2: "world", children: [
      "how",
      "are",
      "you?"]}


The XML tag style is much, much easier to work with when you're dealing with markup. And XML's purpose is to be an Extensible Markup Language. It's way more appropriate than JSON or S-expressions for that.

(Do you prefer to write HTML documents as S-expressions?)


> The XML tag style is much, much easier to work with when you're dealing with markup.

Having explicit end tags makes it harder to make well-formed documents that don't clash their closing tags.

Consider a very typical sort of HTML error:

    <table>
      <tr>
        <td>
        </tr>
      </td>
    </table>
Interleaving tags is never correct, yet XML allows us to do it (and I've seen it happen a lot).

The comparable S-Expr shows how it is just plain impossible to interleave tags:

    (table
      (tr
        (td)))
You might ask yourself, "which close paren closes which list?" if the document were particularly gnarly. But if we're talking about particularly gnarly documents, then XML can be just as ambiguous. You'd be using a text editor that highlighted matching parens for you, at that point, just as much as you'd be using one that highlights matching close- and end-tags.


This is not correct XML, parser will throw error. This is not correct HTML either. The only reason why this code is likely to produce good enough output in browser is that browser tries really hard to produce something readable even from complete garbage.


Yes, I know it's not valid XML If you had read my post and not just skimmed the examples, you would have seen that was the point. XML's verbose end-tag feature makes it possible to make malformed documents in such a way that is just plain impossible with S-Expressions.


Such interleavings can actually be valid HTML5, in that the specification defines an algorithm for parsing that handles such "tag soup" in a reasonable way.


That's not the same thing as making interleaving valid.


What's the difference?


I've never had trouble with that.


> Do you prefer to write HTML documents as S-expressions?

Actually, yes. I use CL-WHO[1] a lot, in which one can write:

    (:html
     (:head
      (:title "Foo bar")
      (:link :rel "stylesheet" :href="baz.css"))
     (:body
      (:p "Lorem ipsum dolor sit amet, consectetur adipiscing elit.
    Vestibulum ullamcorper efficitur purus, at suscipit nunc luctus vitae.")
      (:ol
       (:li "Cras vel est accumsan, malesuada leo eu, iaculis nulla.")
       (:li "Proin nec mi feugiat, posuere enim in, vehicula erat.")
       (:li "Morbi vitae purus nec neque posuere pharetra ultricies in nibh.")
       (:li "Nam maximus lectus faucibus, ullamcorper lectus aliquam, aliquam lectus."))))
Which I contend is prettier than the equivalent HTML.

[1] http://weitz.de/cl-who/


I like S-expressions too, especially when generating markup programmatically. For hand-writing HTML/XML documents, which I do quite a lot, I really enjoy the tag style because of the verbose end tags and the ease of moving blocks. It's at least nice enough to make me annoyed when people claim the tag syntax is some horrible stupid disaster compared to S-expressions or (worse) JSON.


> For hand-writing HTML/XML documents, which I do quite a lot, I really enjoy the tag style because of the verbose end tags and the ease of moving blocks.

I can't say anything about liking verbose tags, which seems to me a matter of taste, but moving around S-expression blocks is easy: C-SPC to set the point, M-C-f to move forward one S-expression, C-w to cut the current region, navigate to where one wants it, C-y to yank the cut region.

Granted, this is using emacs, which really had better have good S-expression-editing capabilities after 40 years!


I also quite like Dylan's way of ending blocks, letting you type for example "end method do-stuff" so you can see clearly what's being ended, which is useful in a document with long sections.

And I like that XML block moving is even manageable with ed, which I actually use sometimes. Well, and vi.


Yes. With good tooling (such as Emacs), markup is much more pleasant to write in S-expressions than XML.


I'll add some more reasons to the flamebait:

- JSON doesn't have namespaces, making integration of different data-sources quite hard.

- XML allows me to do versioning within documents.

- An extremely large corpus of well-tested libraries are available.

- As opposed to JSON, XML and accompanying standards (XSLT, XML Schema, XPath, XQuery) are extremely well documented.

- XML validation, parsing and processing can happen at the same time, allowing streaming solutions. Using the XML schema, a parser can be created which is optimized for a specific stream of data.

(edit: formatting)


All right, I'll bite:

>JSON doesn't have namespaces, making integration of different data-sources quite hard.

Yeah, and how often do you merge two data formats like that into one data format in a way that doesn't require massive transformations anyway?

>XML allows me to do versioning within documents

Well, that's fantastic. Because XML is designed for DOCUMENTS. But JSON is designed as a wire protocol, and a data exchange format, which is very different. You shouldn't use JSON for documents, and I very much doubt that's what OP was talking about.

>An extremely large corpus of well-tested libraries are available.

For your language. But XML is fairly complex, and there are a lot of environments with no support, and JSON parsing is so simple and easy that the complete grammar, as well as the semantics, are on the front page of the website, and it's unlikely your language doesn't have support for it already.

>As opposed to JSON, XML and accompanying standards (XSLT, XML Schema, XPath, XQuery) are extremely well documented.

The grammar and semantics are on the front page of the site. And JSON is simple enough there's little more to it than that.

>XML validation, parsing and processing can happen at the same time, allowing streaming solutions. Using the XML schema, a parser can be created which is optimized for a specific stream of data.

Really? That's actually kinda cool. :-)


Alright, in the flamebait fashion, I'll bite back :)

> Yeah, and how often do you merge two data formats like that into one data format in a way that doesn't require massive transformations anyway?

Actually, quite a lot in the past! Back in 2009 I did some XProc pipelining of messages. These pipelines were a bit like reactive streams, which were (mostly) agnostic of the contents. This allowed me to combine, dissect and route streams of data in an intuitive way. Maybe you can compare it with mapping over a collection: you don't care what's inside, but you want to preserve the contents. XProc was kind of a functional programming + reactive approach to data processing. Pretty cool and ahead of its time, if you ask me.

> But JSON is designed as a wire protocol

Reference, please? Even if I subscribe to one definition of 'wire protocol' on the internet (there are many), I don't think it creates a meaningful distinction between XML and JSON.

> JSON parsing is so simple

Actually, it is, and it isn't. Yes, there are very few primitives (strings, booleans, numbers, arrays, objects), but this also causes important limitations. For example, it is rather cumbersome and unspecified to transfer binary data in a JSON document (base64 encoding). Another thing: how easy is it to parse a streaming JSON document in Javascript?

> The grammar and semantics are on the front page of the site.

Admittedly, that's a lot easier than, say: https://www.w3.org/TR/xml11/ These guys really took it too far...

> That's actually kinda cool.

That's what I thought too when I first heard about it :)


Okay, back to me:

>Actually, quite a lot in the past! Back in 2009 I did some XProc pipelining of messages. These pipelines were a bit like reactive streams, which were (mostly) agnostic of the contents. This allowed me to combine, dissect and route streams of data in an intuitive way.

Huh. So like this:

  |xmlstream|->|transformer|->|xmlstream|
Pretty slick. So the namespacing allowed you to add new tags, without worrying about tripping over the old ones? Cool, but the types of transforms you can do without knowing the internals of the XML you're transforming are fairly limited, and because JSON's objects don't mandate an app-wide wide meaning for a key - the closest thing JSON has to XML tags, you can just attach the new data to a new dict, and the problem solves itself. If you're merging objects, and each gives a different value for a key, than you can set up either an array or an object to hold both, or just send along both objects, wrapped in an array/object like before: in essence, by JSON'S semantics, each object is its own namespace.

>Reference, please? Even if I subscribe to one definition of 'wire protocol' on the internet (there are many), I don't think it creates a meaningful distinction between XML and JSON.

References, I can give. json.org, first paragraph:

  JSON (JavaScript Object Notation) is a lightweight data-interchange format. 
I apologize for being unclear: Data Interchange format is what I meant.

XML was not intended to be a generic data-interchange format: It, like HTML, SGML, and GML before it, were designed for DOCUMENT markup: Human-readable, structured, semantic DOCUMENTS. It has since been pressed into service as a data-interchange format, and it's a testament to how well it was designed that it works as well as it does for that, but its verbosity and general format and layout make it ill-suited to the purpose. JSON was designed for data interchange: I said wire protocol, as Data Interchange is often about sending data between applications on a network, which is what a wire protocol is for.

Hopefully some of that answers your question.

>Actually, it is, and it isn't. Yes, there are very few primitives (strings, booleans, numbers, arrays, objects), but this also causes important limitations. For example, it is rather cumbersome and unspecified to transfer binary data in a JSON document (base64 encoding). Another thing: how easy is it to parse a streaming JSON document in Javascript?

Is it specced to transfer binary data in XML? First I'd heard of it. Base64, uuencode, hex, or raw numbers, there are plenty of ways to encode binary data in JSON, and if you're using any system that has reserved characters (like CDATA in XML, if that's what you're thinking of), than you have to do this sort of encoding somehow. Besides, you could always send the json as a header, and have the app get the binary data from a different endpoint. Although you may want Base64 to avoid the roundtrips...

As for parsing streaming JSON, I don't know of there are any libraries for it, but the implementation should be very simple: Like XML, JSON is a tree, so parser state can be represented as a stack: You see a {, you're now in an object. A [, you're in an array. A , indicates adding a new value to the current array, or a new k/v pair to the current dict. What each character means is deterministic, given what came before, so you can construct JSON as the data comes in, and provide access to each value as it becomes ready. Although, given most JS implementations' multithreading limitations, all this really does is ensure that you don't have to have the entirety of the data in memory before you start parsing. Which is a good idea...


wrt the XProc pipelining: it's been some time, but I recall various possible transforming and matching steps, such as a conditional stream, transformers, reduction steps of multiple streams and others. This could then be combined with XPath, XQuery, XSLT and even SOAP requests. The problem of XProc was similar to the rest of the XML era: it required too much of the implementer to understand. Also, good programmer tooling, such as graph editors for the pipelines were missing. Perhaps this is slightly similar to functional programming and category theory nowadays. The ideas are sound, but it produces too much study and too little profit. Also, to properly work with category theory in programming, it would be very nice to have some graphical tools to view the transformations applied to your code.

Now for the flamebait-y parts. Yes, XML has loads of archaic SGML syntax bits, DTD built-in (hopeless for small parsers) and the attribute/sub-element divide has never been completely solved. But it can be argued that JSON had a similar fate: it literally descended as a subset of ECMAscript. This also explains the lack of separation between integers and floats.

But I agree, XML is definitely not the best data-interchange format, but neither is JSON. Some LISPy syntax would be my preference for data-interchange if it needs to be readable. But I'm trying to argue that this doesn't really matter. The XML era is mostly over, and I'd say we should try to learn from 'the good parts'.

I conflated XML with the XML Schema datatypes, and I shouldn't have, but it has been some time since I last seriously worked with XML. Also, we should consider the whole ecosystem, not just structure. XML Schema actually does spec binary data (http://www.datypic.com/sc/xsd/t-xsd_hexBinary.html).

With regard to round-trips: as a general rule it might not hold if you ask me. It implies an origin and even state! Maybe the sender cannot cache your binary data for a round-trip (memory, legal, latency, security constraints all play a role here). Personally, I like RESTful for simple systems, but for more involved architectures, message passing is much more scalable and easier to distribute.

> As for parsing streaming JSON, I don't know of there are any libraries for it, but the implementation should be very simple ...

Yes, a novice programmer should be able to write it in an hour. But the weird thing is, of all our libraries and all our frameworks (browser-side), none of them do streaming. Ok, I guess we should use websockets with JSON-encoded events for this, but still.

But hey, there are so many metrics with which one can evaluate a data-interchange format. (Recently we did a survey of binary data-interchange formats and found around 25 different criteria... and we were not really being thorough).


Okay. Cool.

I wasn't just trying to flame when I wrote: Talking to people who have different ideas and use different stacks is a good idea, and it teaches you things you didn't know before. And learning about stuff is why I use HN in the first place :-).

>The problem of XProc was similar to the rest of the XML era: it required too much of the implementer to understand. Also, good programmer tooling, such as graph editors for the pipelines were missing.

I will have to look up XProc now, because the things you've been saying sound really interesting, and it's clear I don't really get it.

>But I agree, XML is definitely not the best data-interchange format, but neither is JSON. Some LISPy syntax would be my preference for data-interchange if it needs to be readable.

I suppose. I love lisp considerably more than the next guy, but lisp structures technically only specify linked lists, which are O(n) for all data retrival. This is also how most implementations implement them. Also, JSON is similar, and trival to convert to that format

  ["like", {"key":"this"}]
  =>("like" (("key" . "this")))
Although you would idiomatically use symbols in many places where JSON uses strings.

>XML Schema actually does spec binary data

Once again my inexperience with xml shows. Thanks for letting me know.

>With regard to round-trips: as a general rule it might not hold if you ask me. It implies an origin and even state![...] Personally, I like RESTful for simple systems, but for more involved architectures, message passing is much more scalable and easier to distribute.

Firstly, I'm pretty sure REST implies a message-passing architecture. Correct me if I'm wrong.

Secondly, the round-trip idea sucks for a number of reasons, but I don't think it has to imply either. Let's say you have an endpoint at example.com/<userid>/lastmessage, which might give you the last message the user sent. If the user sent "just ate at Joes, #delicious," you might receive:

  {"message-type":"text", "message":"just ate at Joes, #delicious"}
But if the user sends an image, it's uploaded to the server, and you have to get it down. So you would instead get:

  {"message-type":"image", "message":"X57pqr32"}
And you would ask for example.com/<userid>/static/X57pqr32.

I don't know, but I think that would work.

>But the weird thing is, of all our libraries and all our frameworks (browser-side), none of them do streaming.

And I actually know why this is: Before the advent of Websockets, the only API was either XHR, or awful ideas (JSONP should chill the blood of any security expert). None of them supported reading incrementally, AFAIK, so there was no point. Now that Websockets are a thing, it shouldn't be long coming. Now all we need to do is build something to put JSONP in the ground...

>But hey, there are so many metrics with which one can evaluate a data-interchange format. (Recently we did a survey of binary data-interchange formats and found around 25 different criteria... and we were not really being thorough).

Indeed. By the way, did you look at Cap'n Proto and MessagePack? Neither are really on the fringe, but they look interesting, and they seem to have some decent support.


By the way, a good example of the multiple-trip RESTful API I described is XKCD's JSON API (http://xkcd.com/info.0.json)


Yeps, that's basically HATEOS :)


> Firstly, I'm pretty sure REST implies a message-passing architecture. Correct me if I'm wrong.

It absolutely is, but afaik (correct me if I'm wrong here) it implies an origin. It relies completely on addressable and available resources. It relies on exactly-once semantics (POST) and round-trips. Message passing for me is more like the actor model: ephemeral information, at most-once delivery, references to computers (actors), not data, and most important: the message is centric, not the end-point.

Perhaps I'm understanding all of this completely wrong, I'm honest here, but the actor model for me means 'message passing orientation' and RESTful to me means 'resource orientation'.

> And you would ask for example.com/<userid>/static/X57pqr32.

I implemented more or less the same scheme in a message centric application for crypto. Larger objects such as photos and videos were encrypted and placed in a central storage (later design phase included DHT implementation). The receiver could decrypt the message at a later time, whenever the photo was visible in the app/webpage. The central server, however, was non the wiser, as all data was encrypted and without semantic information. Here it is interesting to note that, even though we use references (URIs), the resource is not identifiable, except for its SHA hash signature. There was no sense in saying: https://kanta-messenger.com/photos/1234abcd since there is no knowledge of 'photo' or 'video'. However, there is still representable state transfer (REST) going on, without any of the semantics.

> . Now all we need to do is build something to put JSONP in the ground...

Agreed

> By the way, did you look at Cap'n Proto and MessagePack?

We had two phases (since it takes quite a lot of time to research each data-exchange protocol). In the first phase, we evaluated on a couple of core criteria: Language support (Scala, Java, Python), no long-standing github issues, more than one core commiter. We reduced that to three protocols: protobuf, Flatbuffers and Apache Avro. To most of our surprise, the last one won. Why? Various reasons, one of them being the possibility to do reflection and search within encoded messages for which the receiver does not have a schema. For example, you might want to create a router which only routes messages which contain a certain header. Another is archiving: since the schema is always included, it is possible to decode messages years after they have been stored somewhere. A third one is forward- and backward-compatibility. All of them were close wins (4 vs. 5 stars), but it brought us to Apache Avro. Looking back on that decision, it was a good one. Many within the company are happy with the choice.


>All of them were close wins (4 vs. 5 stars), but it brought us to Apache Avro. Looking back on that decision, it was a good one. Many within the company are happy with the choice.

Neat, I may check it out.

>It absolutely is, but afaik (correct me if I'm wrong here) it implies an origin. It relies completely on addressable and available resources. It relies on exactly-once semantics (POST) and round-trips. Message passing for me is more like the actor model: ephemeral information, at most-once delivery, references to computers (actors), not data, and most important: the message is centric, not the end-point.

I mean, that IS a valid way to think about it. I think about it like this:

when you're using a REST API, you are sending a message to an application. That application is defined in part by your endpoint: The server, and the path to the app. The rest of your message (params, method, remaining path) is your message. Some applications map the messages you send them onto a sort of virtual filesystem, which may or may not correspond to a real one. This appears in webservers, and many APIs. For these, messages you send primarily consist of paths. Others treat their messages more as procedure calls, and use more params. Both are messages, just as sure as

  cat /proc/sys/net/ipv4/ip_forward
and

  sysctl net.inet.ip.forwarding
even though one uses a filesystem model, and the other uses a command.

But your model of REST, while less linked to message passing, has much less cognitive load.

There's something wrong with me.

Actually, it's funny we're discussing message passing, because I've been working on an app that uses message passing between pre-emptive co-routines, and kinda-sorta unidirectional data flow heavily. Of course, at 2 coroutines per connection, it won't scale. Thankfully, it won't have to.

I hope.


> - JSON doesn't have namespaces, making integration of different data-sources quite hard.

The only reason anyone would ever say that is ... that they have used XML (or SGML) and are subscribed to that mindset.

Every toplevel JSON document is a valid value in any other JSON document. That's how easy it is to integrate. The only reason that XML/SGML needs namespaces in the first place is that the schema dictates what an element, e.g. <block>, can have as children or attributes, and how many -- and as a result, <block><statement/></block> from a programming language and <block><cityblock/></block> from a city design schema cannot be mixed (neither would be valid in the other's schema). So you have to use <code:block><code:statement/></code:block> and <city:block><city:cityblock/></city:block> to differentiate them.

> - XML allows me to do versioning within documents.

What kind of versioning are you referring to? Schema versioning? Data versioning?


I've never really subscribed to the mindset of XML for it has many disadvantages (the verbosity, the complexity of DTD, the tendency for documents which are too large). However, I do subscribe to namespaces, since it allows global referencing of names. I also do subscribe to formal grammars, mature standards, and good documentation. FYI, I've designed streaming JSON based secure messaging systems, did binary-only schema's for speed, and for simple tasks I just implement JSON+REST, since everyone nowadays come to expect it. It's just that I think XML got an undeserved bad reputation and many of the 'good parts' have been forgotten.

With regard to namespaces, when designing standards, it is very useful to separate one 'person' definition from another, since they might not have the same semantics. This allows us to connect, say, 'com.facebook:Person' with 'com.google:Person' with a global equivalence relation. It allows us to specify bridges between standards.

I don't really subscribe to the 'it is a valid value in another JSON document'. It is only valid when it can be interpreted by a receiving program (otherwise it is data, not information). The namespaces are not there for validation (alone), they are there for interpretation.

With versioning, I meant schema versioning. Admittedly, not a great solution, but at least it allows a receiving party to know which parts can safely be interpreted.


> I do subscribe to namespaces, since it allows global referencing of names

I have been forced to use XML one way or another for a variety of uses (mostly integration, not document storage - but still), and have not ONCE had a use for namespaces or multiple DTDs in a single document. I suspect no one has statistics, but I wouldn't be surprised if this is true for 99.9% of {users,documents,systems} - which, if true, means that 1/1000 burdens the rest needlessly. But of course, this is mere speculation.

> I also do subscribe to formal grammars, mature standards, and good documentation.

XML is enticing by appearing to have those, but it actually doesn't, as Naggum articulated in[0]. XML schemas can describe a superficial structure, but not anything non-trivial and definitely not any semantics. Naggum is entertaining though he holds nothing back, see e.g. [1].

> It's just that I think XML got an undeserved bad reputation and many of the 'good parts' have been forgotten.

The problem with XML is that, like lawyers, 95% of the population gives an undeserved bad reputation to all the rest. XML did have some good ideas, but they are almost nowhere to be find in practice.

> This allows us to connect, say, 'com.facebook:Person' with 'com.google:Person' with a global equivalence relation. It allows us to specify bridges between standards.

No it doesn't, unless they are semantically equivalent - which they never are. They might be superficially similar, with some translation possible using (e.g.) XSLT. But if, for example, com.google:Person has <DisplayName> and no first/middle/last, and com.facebook:Person has <FirstName>, <MiddleName> and <LastName> (but no display name), then XSLT can only translate one way, and nothing can translate the other way without error. It's nice in theory, but - projecting from my experience which is long and across many industries, but obviously still anecdotal - in practice, the semantic differences always require logic beyond XSLT, and thus the namespaces are only of aesthetic value if any.

> It is only valid when it can be interpreted by a receiving program (otherwise it is data, not information)

True. How is that different than XML or anything else? The same statement applies to XML, namespaces or not. If the program doesn't know what it is interpreting, the namespaces do not matter. If it does know, they don't matter either. Sure, it's a way to mark the source through _all_ elements, but since the program must be aware anyway, you can just as well enclose your Person object with {Facebook: {first:'John', last:'Smith'}} or {Google: {display:'John Smith'}}. Yes, XML has a standard way of doing that - but in practice my experience that it costs about 1000 times what it provides.

> With versioning, I meant schema versioning. Admittedly, not a great solution, but at least it allows a receiving party to know which parts can safely be interpreted.

And what if semantically, the parts you don't know about make interpretation moot? Practically, if it's a version you don't know, you shouldn't try to interpret it. And that's achieved by a simple 'version' field in JSON. The standard way of doing this buys practically nothing - 99% of XML files out there do not declare or properly follow a DTD.

[0] http://www.xach.com/naggum/articles/3224504693262432@naggum.... [1] http://www.schnada.de/grapt/eriknaggum-xmlrant.html


> and have not ONCE had a use for namespaces or multiple DTDs in a single document.

I'm actually rather surprised about that. Take an XSLT document and you're bound to use multiple namespaces. Have you never used an editor which provides tab-completion, quick validation and documentation of tags on the fly? The systems I worked with heavily relied on namespaces for validation, exploration, versioning and prevention of naming clashes. These, however, were heavily distributed systems within government organisations.

Also, please note I'm talking about namespaces within and outside XML. I'm saying that namespaces are a cheap and easy to implement design rule.

Ok, now your references.

[0] is actually an argument for namespaces (and XML schema or suchlike). If I understand correctly, he proposes a system which allows you to specify part of an XML document post-hoc (using a namespace which references a schema which is specific to the module-writer).

The second one, I must admit that for me it had a low signal-to-noise ratio. The writing seems to refer only to XML and DTD, but nothing about larger ecosystem, which my arguments were about. Anyways, remove all the banter and you're left with a couple of arguments:

1. the syntax is verbose (yes it is, nobody disagrees, not even the designers).

2. there is no macro support (perhaps useful, one could embed an XSLT stylesheet if necessary). I find this a minor point. It would also severely complicate the parsers and make them stateful and memory-bound.

3. binary representation (in line with 1). How many good, portable binary structured editors do you know? How much does the size of an XML document improve with simple compression to binary? (hint, quite a lot). Also, when going to binary, there are many other design choices, such as: should it be possible to memory-map the document so the CPU is not involved? Should pointers be employed so we can skip sections of the document? Should we use names, ids, UUIDs? Do we optimize for processing use, network use, memory use? [1] seems to only argue about network utilization (which, for most applications, is abundant).

The rest of the document (I have to admit, I skimmed some parts), appears to be a rant on everything and everyone stupid. The king who shouts: "I am the king!", is no true king.

<skipping some parts>

> If it does know [the namespaces], they don't matter either.

To structure something, we first need to construct, e.g. bring together and later we need to deconstruct. In both cases, it helps to have namespaces because (de)constructing might involve many different distributed parties, with different versions of software. It is my opinion that in truly distributed systems, naming and typing are of utmost importance.

It's getting quite late here. Thanks for making me think about this subject again. Sadly I cannot answer all of your points within a reasonable time.


XPath expressions are actually pretty cool (and I say this as not a great fan of XML in general). The ability to search and select elements is something we use all the time. For many examples, search for "xpath_" in https://github.com/libguestfs/libguestfs/blob/master/v2v/inp...


I'd drink to that, XPath is incredibly useful and easily mastered.

I'm also fond of XSD and XSLT myself. They can be obtuse at times, but have been indispensable in the use cases that I've needed them for.


If you're actually marking up text, JSON doesn't really work. XML works alright for its intended use as a language. (That doesn't mean I hate it any less.)


I think if you're marking up text, you really want to be using markdown one of the other wiki like formats. For writing you're much more likely to run into non-technical people who will be severely impacted by syntax errors.


If you are working with statically typed languages, the validation of XML is far superior to anything you can do over json, unless you want to write your own formats for defining data structures in JSON.

JSON, remember, was written for a language without even firm object structures. It is great in that environment but all exchange formats require external knowledge to validate, and XML unlike the others provides a way to do that.


For one, JSON didn't exist 15 years ago.

For another, JSON didn't have validation or schemas 5 years ago.


Even there, the schemas and validation are very lightweight compared to what XML can do.

As I usually say, JSON is for relatively free-form, dynamically typed languages, but if one side is uses a statically typed language, XML is probably the better choice.


Hmm... I always tend to use XML for B2B communication or Mine-to-Theirs type RPCs. I use JSON primarily for Client-to-Server communication internally with own applications or for public APIs.


JSON was defined in April 2001, basically a subset of JavaScript specs from 1999.

So, JSON did exist 15 years ago, but not 16 years ago, although you have to go back 20 years if you want your statement to not just be about the name.

And yet ... how many of the decisions to use XML go back those 15 years? Hardly any.


JSON may have been created 15 years ago but it wasn't well known or commonly used for a number of years. Yahoo! only started using it in 2005 and Google in 2006. XML had been around and in use for years prior to that and even today has a much richer toolchain.


Because its prefered or reqired by a well paying customer. With any sane web framework you get both JSON and XML out of the box. If the well paying customer doesnt send an XML Accept header you make it the default respose type and tell everyone else to send a JSON Accept header. If theres a conflict between multiple well paying customers, you make new end points.


Working in a statically typed language, and using schemas to ensure that the generated messages are correct, because we can generated classes that map onto the required data structures.

So far as I know JSON doesn't allow for that.


JSON is nice for structured data but when it comes to more mixed document style data XML is preferable IMO


Most of the advice applies equally to generating JSON data as well.


xaml for UI.. FML


(2005)


Yes. Added.


Some of this (avoiding pretty printing, mainly) is just dealing with XML's insanity. The rest is pretty solid advice, but fairly obvious for the most part. Then again, I've already done a lot of Scheme programming, understand common sense, and I read Steve Yegge's The Emacs Problem, So I already looked at XML as a tree structure, and crawling a pre-existing tree to turn it into XML is just the most natural way to deal with XML in Lisps.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: