Hacker News new | past | comments | ask | show | jobs | submit login
Not the comp.text.sgml FAQ (2002) (flightlab.com)
94 points by Tomte on Nov 12, 2017 | hide | past | favorite | 39 comments



  Q.  I'm designing my first DTD.  Should I use elements or
      attributes to store data?

  A.  Of course.  What else would you use?
I giggled.


This made me giggle:

Q. What's so great about ISO standardization?

A. It is often said that one of the advantages of SGML over some other, proprietary, generic markup scheme is that "nobody owns the standard". While this is not strictly true, the ISO's pricing policy certainly has helped to keep the number of people who do own a copy of the Standard at an absolute minimum.

    [ Ed. note: I'm not exactly sure why this is seen as an advantage,
      it's just something people say. ]


I don't know about SGML, but in XML attributes are subject to (mandatory) Attribute-Value Normalization whereas white space normalization for elements can be disabled if you have control over the parser.

Both (elements and attributes) are unsuitable to store arbitrary data in a straight forward manner. If you absolutely must, you should know about normalization, XML-whitespace handling and CDATA.


And you, are why the parent poster giggled.


Enlighten me. We both said XML is not for data storage, no?


If you find yourself torturing your head about whether to use elements or attributes, then you're attempting to use a markup language as a general-purpose data representation format, something it wasn't designed for.

SGML and XML are for text, optionally marked up with tags/elements. Attributes are for data about element presentation, and not intended to be displayed directly. It's as simple as that.


99% of XML applications are not for text. Take SVG as an example.


True, but SVG's saving grace is that it's designed to be used embedded in HTML (or in XHTML back then). But it doesn't make a lot of sense otherwise. Eg. take SVG2's switch to represent drawing order in a z-index-like property, as opposed to SVG's original "painter model" where drawing order is determined by document order. Basically, SVG doesn't make any use of distinguished markup features such as element content (text elements being the only exception) and document order.

Usage of XML in business data and non-text document formats OTOH is an accident IMHO, but is still the most robust format for data exchange and archival in long-term commitment scenarios we've come up with so far, and I don't see that changing anytime soon because there aren't that many open standards being developed anymore.


SVG was a mistake, but we're stuck with it.

Does SVG offer any advantages over, say, EMF/WMF? I know a strong advantage of WMF is that the file format translates 1:1 into GDI calls, which makes it very fast - but I don't see EMF files rendered with anti-aliasing or complex gradients. What about PDF or PostScript?


If you're talking standards, then platform-specific advantages mean precisely nothing.


SVG would actually be great if it wasn't such a feature creep. Filters, masks, scripts and animations are supported only by some renderers, often inconsistenly. This does not seem to stop SVG generators producing documents using these features, often unnecessarily.

Good SVG can even be human-readable and -editable. I've actually fixed simple broken SVGs with vim and a pocket calculator.

PostScript is great but as a binary format not very modern, where text-based formats seem preferred. I had to look up WMF and dumping calls to Microsoft API as an 'open' standards does not sound too exciting, either.


Postscript is a text format :-)

Its disadvantage is that it is a full on programming language, not declarative, so it isn't friendly to editing tools.


You are right on both counts! I seem to have mixed it with PDF.

Sadly, programming is coming to SVG as well with more and more renderers supporting JavaScript. Which might be great for some use cases, but again even further disperses the field of possible generator/feature/renderer combinations that might (will) fail.


Oh dear god no. Trying to parse EMF+ is a bit of a nightmare, and it’s not just 1:1 GDI calls, there are dual EMF/EMF+ files, it’s stack based (I suppose not so bad), with klydge after kludge (especially around text)... we use it because we have to.


EMF+ has GDI+ primitives in it, which do allow for anti-aliasing and gradients.


>I know a strong advantage of WMF is that the file format translates 1:1 into GDI calls

S... strong advantage?


Simplicity of implementation. GDI's drawing primitives exist in practically all other drawing APIs so translating it to Cairo, OpenGL, or Direct2D should be trivial.


Hardly. You are coupling the Windows GDI to an image format. It’s horrible to parse, and reliant on Windows quirks. It’s really not that great.


Then 99% of XML applications are using it improperly. I can live with that estimate.


I'm confused. SGML/XML is capable of representing arbitrary tree structures (with metadata attached to the nodes), is it not?

Somewhat of a tangent, but I like the name of the alegraic structure for binary trees: magma [0].

[0] https://en.m.wikipedia.org/wiki/Magma_(algebra)


> I'm confused. SGML/XML is capable of representing arbitrary tree structures (with metadata attached to the nodes), is it not?

So is ASN.1, but nobody in their right mind would use ASN.1 as a markup language. The sole fact that it is possible to cram the structure into the format doesn't mean it's a good idea.


I can not vouch for right-mindedness, but the ASN.1 Markup Language web site[1] mentions "Encoding XML-defined messages using ASN.1"

[1] http://xml.coverpages.org/asn1-aml.html


Wait. A tree is just nested tags, right? I feel like I'm missing something.

Here I'm thinking of the correspondence between trees and strings of nested parentheses. So, strings like

(a (b c) d) (e f)

carry a tree structure, and any (non-associative) multiplication is just a fold operation over a tree. In this case it's like we tag each set of prentheses with a name/operation, so naively XML would seem to naturally represent this kind of thing.

I have little actually experience using XML directly though, so am genuinely curious as to what so terrible or "crammy" about my ideas here.


There's a lot of weird ideas floating around in the comments on this link. I don't think you are on the wrong track.

There is some mismatch between what XML was designed for and the problems XML is good at solving. It was most certainly designed to be easy for humans to read and write manually. In practice, it is a great interchange format, by which I mean the specific idea of different parties writing XML for exchange between each other, because it can be validated mechanically. There is a large and powerful ecosystem of software that has sprung up around it which simply isn't there for s-exps or JSON.

Elsewhere on here, marcoperaza points out that it is kind of against hacker ethic. That's true. But enterprise software often involves multiple separate organizations having to agree on what a document can contain, and XML is great for that, and that use case tends to be more valuable in industry than whether it is the tersest, most flexible or readable format.


this is a sweeping statement, and not true.

for example, i often use xml documents with no text - for example

<contraindications> <pregnancy/> </contraindications>

and really the rule of thumb about attributes vs elements is more about if you are going to need chikdren of item - easier to extend elements with child nodes..


Might as well use YAML or TOML in that case. Same use case and you are not limited to ASCII strings for identifiers.


You aren't limited to ASCII for XML either.


Right.

Almost all characters are permitted in names, except those which either are or reasonably could be used as delimiters. The intention is to be inclusive rather than exclusive, so that writing systems not yet encoded in Unicode can be used in XML names

Source: https://www.w3.org/TR/REC-xml/#sec-common-syn


>SGML and XML are for text

I'd say anything which can be nested.

For example, XML makes sense for a layout engine:

Arbitrary amounts of buttons can fit inside a layout? Make it a child.

Text in a button? If only one and can't have children tags? Make it an attribute. Else -> Make a child.


In a similar vein there is "XML Sucks" [1] and "S-exp vs XML" [2]. Both have been discussed here on HN in the past.

The first one claims (without giving a source) that James Clark once said or wrote:

“Any damn fool could produce a better data format than XML” – James Clark 2007-04-06

[1] http://harmful.cat-v.org/software/xml/

[2] http://harmful.cat-v.org/software/xml/s-exp_vs_XML


I'm genuinely perplexed at that animosity towards XML. In that second link I'm unable to find any substantive problem other than that "XML endtags make it too verbose". That seems like a legitimate thing to worry about when considering serialization formats, but where is all the vitriol coming from?

Bizzare.


Beyond basic usage, XML is pretty complex: schemas, namespaces, XSLT, XSD, DTD, XPath and XQuery.

There are plenty of good arguments for the XML way of doing things. For example, having a rigorously defined way (XSLT) to specify transformations of schema-conforming XML is more robust than ad-hoc code that wrangles schemaless JSON.

But it does go against the hacker ethos and stands in the way of rapid development. And wherever it is used, complexity and verbosity seem to often follow. Look at SOAP, for example.


Yes. As a tree structure, basic xml is fine. There are lots of other ways to do it, but there's really not much difference. For communication, agreement on a common language is more important than its intrinsic merits. (and xml partly attained that because it looks like html, which was still new at the time).

But the xml ecosystem is horrible. Sensible ideas, horrific execution; like namespaces and schema. Probably the single worst problem was using xml syntax itself: it's like, a programming language that uses JSON for its syntax.

But also, there's guilt-by-association, people hate the enterprise culture that uses xml - similar happened to java.

Though xpath is not so bad, and many people seem to quite like it.

Finally... json is a better match for data, basically by being c-like. However, an ecosystem tumour is also growing, around JSON. Some even use json syntax itself...

I wonder, if perhaps, a root issue is that the world is complex, and youthful simplicity is corrupted as it adapts to cope with the real world... There is hope, however; tools like `jq` never existed for xml.


JSON and XML aren't really equivalent though. XML is a markup language, JSON is a data serialization format. Of course a lot of hate for XML comes from people trying to use it to serialize data, which it can do but only in a clunky way. JSON is just plain better suited for what most people need.


  s/serialization format/object notation/


Thank you (and @marcoperaza) for unpacking things a bit. I was mostly unaware of the sideband technologies that go along with the basic tag and attribute structure, and your comments about the community and "guilt by association" got me thinking about library support and other practicalities which I hadn't been considering. Thanks :)

This also got me reading up on various structured-data formats: XML, YAML, JSON, TOML, HCL, etc. I'd really like some big table comparing various features but can't seem to find anything of the sort.

I found a link [0] that has comparisons between JSON, TOML and YAML representations for various types of data. It's neat to see how each becomes more or less verbose depending on the kind of data getting encoded.

[0]: https://gohugohq.com/howto/toml-json-yaml-comparison/

edit: I found a table:

https://en.wikipedia.org/wiki/Comparison_of_data_serializati...


JSON should only really be used in a machine-to-machine context i.e. for serialization. If you want something that's easily editable with a text-editor then use YAML. XML is horrible for hand-editing, but is easier to read than either YAML or JSON and is super flexible, and has heaps of support in terms of tooling and libraries, and is well understood, so it's great for information interchange. Issues around verbosity aren't a huge issue for machines and again this verbosity is actually helpful when doing data-integration. Concerns about wasting space are spurious since any redundancy in data can easily be done away with with some basic data compression, and as somebody else mentioned, xpath is actually pretty good, and makes the verbosity actually a net positive for certain applications.


I think `jq` is more powerful, but for those afflicted with xml instead of json, there is `xmllint`. In a pinch, one can use it to hack something together (along with xpath and xslt).


> but where is all the vitriol coming from?

XML is fine if all you want is a human-readable format to define tree data structures such as documents to be used in applications where only strings are used and someone within the use case needs to have the semantics of each node and each attribute spelled out quite clearly and unequivocally in the document structure itself.

For any other case, XML is horrible.

Now, consider that XML is used quite extensively in any other case beyond tree-based DOM data structures that it was designed for.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: