Invisible XML is a language for describing the implicit structure of data

tannhaeuser · on July 16, 2022

Hmm, on the one hand this proposal makes XML come full-circle and re-introduce SGML concepts (eg. SHORTREF) that were explicitly omitted from XML as a simplified SGML subset for canonical angle-bracket markup without the need for markup declarations; OTOH, Norm and Steve are fully aware of SGML. I'd really appreciate if whoever wants to re-introduce SGML features to XML would justify and align their proposal with SGML, just as XML has been introduced as a proper, well-aligned subset of SGML.

solardev · on July 16, 2022

Now you can export your SQL dump to YAML, parse it to XML, convert it to JSON, put it in NoSQL, messagepack it into a key-value store and finally be able to query your customer names!

tetris11 · on July 16, 2022

If only things were that easy!

PaulHoule · on July 16, 2022

This looks like a parser generator that makes an object tree in the form of an XML document.

quesomaster9000 · on July 16, 2022

I was going to say the same, it looks like an (E)BNF to AST parser which outputs XML.

My only quandary would be whether the output XML structure could be ambiguous given the parse tree and input (requiring lots of context-dependent if/then logic when interpreting the XML). Perhaps some kind of invisible XML stylesheet could pre-process the AST before outputting the XML.

And secondly, can it handle CSV? If so, along with a command-line app like `jq` it could be an extremely useful addition to the general purpose data munging toolkit. Or do I have to pass the input through a 1000+ byte `sed` script first to normalize it.

sixdimensional · on July 16, 2022

I too was also about to say the same.

I was literally just working on a project with TatSu[1] in Python which contains language elements that are conditional based on patterns in the syntax. I found that the added RegEx matching to EBNF-like syntax was quite powerful.

My only issue was the generated parser appeared to parse the entire content into memory rather than stream parsing.

I haven’t read enough on ixml yet, but, while it seems like while it would unlock many existing toolkits (I never hated XML but went away from it as the industry did), it seems like parsing through an XML format should be done in a stream mapping fashion, and not persisting data, for the XML to be truly “invisible”.

Adding the overhead of XML back into the processing chain… hmm, honestly have to think if the value of accessing the data and XML toolchain is worth it.

I’d almost rather see something that can read any input as a stream with a grammar and produce a stream (that can be materialized) of more optimized, yet open format, that can be compressed but handle complex types.. like protobufs, or flatbuffers. I’m not sure that humans need to read raw data files so long as we have great and open tooling to view data in binary formats (iirc, the biggest argument against XML and the added overhead originally).

[1] https://tatsu.readthedocs.io/en/stable/index.html

secondcoming · on July 16, 2022

+1 XML is to data what UML is to software

ironhaven · on July 16, 2022

> Invisible XML (ixml) is a method for treating non-XML documents as if they were XML, > enabling authors to write documents and data in a format they prefer while providing XML for processes that are more effective with XML content.

Interesting although this seems a little out of date because xml seems to one of the least desirable data formats for modern programming languages. Maybe this is more useful for legacy enterprise use cases that I don’t know about.

gavinray · on July 16, 2022

The fact that it's XML is not super relevant -- what you're working with in code is a tree structure that may as well be JSON or YAML

This is essentially a way to write grammars for things and get the ability to parse them as trees in a common format that is interchangeable with things like JSON.

I honestly don't see much of a difference between this, and something like a PEG grammar where you do this:

  let parsed = peg.parse(input, grammar)
  let xml = json2xml(parsed)

lucideer · on July 16, 2022

The popularity of the target isn't as really relevant here as the capability of the target. XML supports annotated trees (attributes + child nodes) whereas most popular modern interchange formats only support one of those dimensions (child nodes). Some of them do support types that xml lacks (integers, null, etc.) but these can be annotated in xml so the lack isn't critical.

Ultimately what all that means is that all e.g. json documents can be represented losslessly in xml, whereas the reverse is not true without explicit external schema. Which means targeting XML covers other less capable interchange formats implicitly.

tokinonagare · on July 16, 2022

XML still has few advantages over JSON: comments, namespaces, a canonical form and slightly better extensibility.

mpyne · on July 16, 2022

JSON either has those (namespaces, extensibility) or the advantage is slight at best (comments, canonical form).

There is a real advantage of XML over JSON not mentioned though, which is its usefulness in annotating computer-readable data into an otherwise human-editable document. There's not a lot of these cases, and where they're at you're probably still better off using AsciiDoc, Markdown or even HTML instead, but those use cases are out there and JSON is awful for those.

argomo · on July 16, 2022

I'd add that another advantage of XML is that you get a powerful set of tools that are readily accessible across all "working" programming languages, often provided by first party libraries. Plus a lot of developers (and text editors) know the basic syntax.

Granted, JSON has achieved this same level of universality, but everything else (excepting perhaps CSV files) either suffers from obscurity or from weak/ambiguous/competing specifications.

The dream of semantically rich documents that XML provides (like, say, being able to cleanly interweave MathML, SVG, and XHTML in one document) is unmatched.

I think we'd see a lot more XML usage if it hasn't been over-promised, over-delivered (WS-*), and over-used (enterprise Java). If XML had stuck to its lane (making a schema language like RelaxNG instead of XmlSchema, for one thing) it wouldn't have left such a bad taste in so many people's mouths.

tokinonagare · on July 17, 2022

Having worked in the lexicographic space, XML is a format loved in that space because in the simplest forms it's understandable by non-technical people.

mcswell · on July 16, 2022

"the advantage is slight at best (comments...)" I suppose that's true if the data is generated by machine and consumed by machine, and never edited or looked at by humans. But JSON seems to be used in a lot of datasets that humans do touch, like config files that you can edit. Visual Studio Code is one such user, and I've had to edit my own VSC config files. And I use comments, because http://catb.org/~esr/writings/unix-koans/prodigy.html. Fortunately, VSC allows comments between a // and a following newline.

mpyne · on July 18, 2022

> Fortunately, VSC allows comments between a // and a following newline.

Yes, that's my point. If a human is required to look at the JSON it's not hard to find a parser that permits comments. So of course it would be supported if it was actually needed.

Even without that though, you can do something like the trick I used in a JSON-based format:

{ "__comment": "For help with this file, go to http://wiki.example.com/...", ... "foo": "..." }

That does have the problem that your programs processing the JSON need to strip out keys with that name if it would interfere with the program. So it's probably easier just to add the line of code in your parser saying to permit comments.

Meanwhile, XML comments have issues of their own, as you have to be familiar with SGML rules on how -- are handled to safely use or edit XML comments. So while they are nice to have it's not as if they have no nuance either.

int_19h · on July 16, 2022

And also data query & transformation languages like XSLT and XQuery that are designed around its data model.

sgc · on July 17, 2022

Exactly. Xpath is great. Also, being able to have as many types of closing tags as you want to define, makes it much easier to visually parse what is going on in a complex data structure. That alone has a ton of value. I am still struggling for the right mix of highlighters and tools to work in various formats/languages, due to the opacity of closing structures like }]}]];.

Now, there is a trade off in more complex paths to interoperability with Uncle Joe and his DTD. It is still easier than trying to parse some convoluted json coming from yet another api that the dev team never dogfed a day in their life - because you can jump levels so precisely in xpath due to those verbose types and attributes. So as to XML verbosity at least, which for many appears to be the main complaint, for me it is worth it when done right.

tokinonagare · on July 17, 2022

I agree. User mcswell mentioned in another comment JSON config file, and I always fell somewhat uneasy with them (same but worse with YAML), while the redundancy in XML make it nicer to work with in my opinion. I'm very glad .NET Core switched back to XML-based MSBuild format instead of sticking with their new project.json stuff.

Dylan16807 · on July 16, 2022

Surely you could lay out a canonical form for JSON in like five minutes?

Oh, here: https://www.rfc-editor.org/rfc/rfc8785

No duplicates, no whitespace, sort the keys, copy number serialization from javascript, a couple other little details.

tokinonagare · on July 17, 2022

> Published:June 2020

This only two years old while JSON is in common usage for more than a decade... Without a spec there's dozen way to achieve it, for example by sorting keys using UTF-8 instead of UTF-16 values like done in this document, and the slightest difference would break things when used with crypto.

Dylan16807 · on July 17, 2022

If you're doing something cryptographic you need a spec for that anyway, so the extra work to specifying something like json sort order doesn't seem like it's a major factor to me.

tokinonagare · on July 18, 2022

There's a spec for that already for XML: Signed XML. And the tooling that supports it, for instance on the .NET platform. So no need to create yet another spec and tools.

Dylan16807 · on July 19, 2022

Cool, then that spec is more useful than the one I was replying about.

tartoran · on July 16, 2022

At the expense of excessive verbosity. I used xml and still do to this day and it is one of my least favorite formats to read and edit.

zapperdulchen · on July 16, 2022

There are XML workflows in technical documentation outside of software development where something like this could be of interest.

adamretter · on July 16, 2022

I think one of the interesting points missed here, is that if you can convert to XML then you can use the large selection of mature tooling around XML to query, transform, and process your data.

layer8 · on July 16, 2022

So, this is a grammar format specification, plus a specification of how ASTs parsed using such grammars are output (serialized) as XML, and some requirements on “processors” doing the parsing and serialization if they claim conformance to the ixml specification.

pshc · on July 16, 2022

I see you’re trying to sneak XML into common use again. Haven’t we already suffered enough?

oofbey · on July 16, 2022

I think the kids who never really used XML might find it quaint and charming.

mcswell · on July 16, 2022

I suppose you think JSON is better.

mdaniel · on July 17, 2022

The mime type handler on https://invisiblexml.org/ixml.ixml link is bogus, causing the browser to download what appears to the normal eye to be a text file. If the authors see this, I'd recommend changing that to point to the GH repo https://github.com/invisibleXML/ixml/blob/master/ixml.ixml which allows actually viewing it as well as making folks aware of the GH org

hyperpallium2 · on July 17, 2022

xsugar is a similar solution (to a different problem): a mapping between a context free grammar and XML grammar, thus between text and XML. https://www.brics.dk/xsugar/

    XSugar makes it possible to manage dual syntax for XML languages. An XSugar specification is built around a context-free grammar that unifies the two syntaxes of a language. Given such a specification, the XSugar tool can translate from alternative syntax to XML and vice versa

iFire · on July 17, 2022

So let's say I have a binary format.. Let's say PMX.

https://gist.github.com/felixjones/f8a06bd48f9da9a4539f

How do I implement a C++ library to parse PMX to XML and from that XML back into PMX?

mdaniel · on July 17, 2022

I don't get the impression this is designed for binary formats, merely "non XML" ones. The task you described sounds like a better fit for https://kaitai.io/

kaba0 · on July 17, 2022

How is it different than ANTLR? Like one could quite trivially add an XML serializer to it and get a functionally equivalent program, right?

nojvek · on July 17, 2022

I read some samples, how is this better than good ol JSON?