Hacker News new | past | comments | ask | show | jobs | submit login
Nobody expects CDATA sections in XML (lcamtuf.blogspot.com)
219 points by dmit on Nov 30, 2014 | hide | past | favorite | 116 comments



The irony is that CDATA isn't even very useful; there's no way to escape the ]]> closing tag so you still have to invent some special escaping mechanism to use it.

Nobody expects entity definitions in XML either, and yet about once a year some new service or software is found vulnerable to XXE attacks. (Summary: a lot of XML parsers can be made to open arbitrary files or network sockets and sometimes return the content.)

XML is a ridiculously complex document format designed for editing text documents. It is not a suitable data interchange format. Fortunately we have JSON now.


You are allowed to chain CDATA-tags: ]]]]><![CDATA[>

> CDATA sections may occur anywhere character data may occur;

(http://www.w3.org/TR/REC-xml/#sec-cdata-sect)


Is it just me who sees this as a very bad idea?


It's really handy for some use cases. For example, if you are writing documentation involving XML, you might put your example XML in CDATA sections. The way CDATA works is also far simpler than XML's predecessor, SGML. I'd say it's a good feature.

It sucks that there are some shoddy libraries out there, but oh well.


I mean chaining CDATA tags as a way to escape CDATA tags as a bad idea.

It'll break anything that doesn't expect it - in particular I can see anything that does round-trips from/to XML breaking.


Everything already must expect it. Nothing in XML prevents "some text content <![CDATA[blah blah]]> other text content" from appearing in a text node. There is no obligation that CDATA must be the only thing in a text node. IIRC, many parsers will already return multiple nodes for text if you put an entity in the middle ("text &amp; text" coming back as three nodes), so you're already really and truly broken (i.e., not just "theoretically" broken) if you can't handle consecutive text-like nodes and merge them in some manner. This is especially true since you can entity-encode anything at all, so you already must be able to handle "hello w&#111;rld" properly anyhow or you've got a straight-up bug.


> XML is a ridiculously complex document format designed for editing text documents. It is not a suitable data interchange format. Fortunately we have JSON now.

XML is about as simple as it gets for structured text documents. HTML is more complicated. Plain text is not expressive enough. Markdown, Asciidoc, reStructuredText, Wiki Creole, etc. all have pretty severe shortcomings by comparison to XML, and text processing systems will sometimes just convert those formats to XML. XML is easy to parse, easy to edit, and easy to emit.

XML also gives us SVG, which is lovely.

Yeah, use JSON everywhere else. But XML is not ridiculously complicated. The 1.0 specification http://www.w3.org/TR/REC-xml/ is not very long.


> XML also gives us SVG, which is lovely.

That implies that SVG needed XML. SVG just needed a structured data format. It could just as easily have used JSON or Protocol Buffers and it would still be SVG.

> XML is easy to parse

More like it seems easy to parse. Plenty of people think they are parsing XML but their ad hoc "parsers" know nothing about CDATA, DTDs, external entities, processing instructions, or comments.

> easy to edit

...except for gotchas like the fact that you need to entity-escape any ampersands in attributes (like "href").

> and easy to emit

Harder than it sounds. A single error renders the whole document invalid, and when you're compositing information from different data sources, it's easy to make mistakes: https://web.archive.org/web/20080701064734/http://diveintoma...

> The 1.0 specification http://www.w3.org/TR/REC-xml/ is not very long.

Sure, but combined with the other specs which are assumed to be part of a modern XML stack (namespaces at least, and often XML schema, XSLT, etc), you have grown a pretty complicated mess that isn't a great match for what it's often used for.


> It could just as easily have used JSON or Protocol Buffers and it would still be SVG.

Definitely not protocol buffers. XML gives us some really nice bits of SVG, like the ability to put attributes and tags in namespaces, so you can use Inkscape to edit your SVG file, store a bunch of Inkscape-specific data in the SVG file, and not have other editors puke. SVG isn't just data interchange, it's a document edited by humans. JSON doesn't accommodate that very well.

> More like it seems easy to parse. Plenty of people think they are parsing XML but their ad hoc "parsers" know nothing about CDATA, DTDs, external entities, processing instructions, or comments.

I wouldn't use an ad-hoc parser for JSON either. The problem here isn't XML, the problem is thinking that you can solve your problem with regular expressions. Ignoring or throwing errors for DTDs, external entities, and PIs is reasonable behavior most of the time, and most parsers can be set to strip comments and erase the differences between entities/CDATA/text. This behavior is good enough for 99% of the use cases.

> ...except for gotchas like the fact that you need to entity-escape any ampersands in attributes (like "href").

Escape sequences in XML are better than JSON, at least. Try escaping a character from the astral plane in JSON, you have to encode it in UTF-16 and then encode each item in the surrogate pair as a separate escape sequence. This is insane. (Yes, sometimes you want to transmit JSON or XML in 7-bit).

Escape sequences are a natural part of any text format, other than plain text.

> Sure, but combined with the other specs which are assumed to be part of a modern XML stack

Most of the modern XML stack is a mistake, a symptom of the years when people thought XML was the coolest thing ever. XSLT is the worst mistake of all. That doesn't mean that you have to use it. Most people don't.

Let's not fall into the trap of thinking that we should use JSON for everything, just like so many fell into the trap of thinking that they should use XML for everything. Both have their use cases.


> Definitely not protocol buffers.

You know Protocol Buffers has a text format, right?

> I wouldn't use an ad-hoc parser for JSON either. The problem here isn't XML, the problem is thinking that you can solve your problem with regular expressions.

You are drawing a false equivalence. No matter how you slice it, JSON is far, far simpler to (correctly) parse than XML.

> Escape sequences in XML are better than JSON, at least.

The best case you have for this is that you need to encode high-Unicode characters over a non-8-bit-clean channel? This a very fringe use case. And your argument is that "&#x10E6D" is way better than \uD803\uDE6D? Neither of those look particularly user-friendly to me.

> Let's not fall into the trap of thinking that we should use JSON for everything, just like so many fell into the trap of thinking that they should use XML for everything. Both have their use cases.

The point is not that JSON is best for everything, the point is that XML is best for almost nothing.


> Definitely not protocol buffers.

I fail to see why protocol buffers couldn't have worked for SVG. They are expressly defined so that you can have extensible types without breaking parsers...

> I wouldn't use an ad-hoc parser for JSON either. The problem here isn't XML

...but you can find plenty of "parsers" for both that don't actually parse either fully, correctly, or securely.

> Try escaping a character from the astral plane in JSON, you have to encode it in UTF-16 and then encode each item in the surrogate pair as a separate escape sequence.

One of many reasons to be annoyed by those who think JSON is a perfectly good data format.

> Escape sequences are a natural part of any text format, other than plain text.

Another reason to loathe them. ;-)

That said, you can avoid escape sequences in text formats, so long as your parse rules are length delimited, rather than based on reserved characters.

> Most of the modern XML stack is a mistake, a symptom of the years when people thought XML was the coolest thing ever. XSLT is the worst mistake of all. That doesn't mean that you have to use it. Most people don't.

Agreed, but generally if you aren't supporting it, you aren't using XML.

> Let's not fall into the trap of thinking that we should use JSON for everything, just like so many fell into the trap of thinking that they should use XML for everything. Both have their use cases.

Yes, though I'd argue they are primarily for causing problems.


Actually JSON is such a simple format, that everybody with a bachelor in computer science should be able to write a parser as correct as standard implementations (they often ignore that the root element needs to be an object and multiple entries with the same identifier)


You're actually wrong.

(1) "A JSON text ... conforms to the JSON value grammar."

(2) The entire standard for the object grammar is, "An object structure is represented as a pair of curly bracket tokens surrounding zero or more name/value pairs. A name is a string. A single colon token follows each name, separating the name from the value. A single comma token separates a value from a following name."

In other words, people SHOULD ignore both of those, because ignoring both of those is a part of the ECMA-404 JSON standard. (The latter actually wasn't even a part of RFC 4627; the operative word there was SHOULD, not MUST.)


It's simple, but it's limited at the same time. And like when one tries to use inappropriate data structure, this may lead to issues.


Yes, not allowing comments for example. How could one come up without such a basic feature?


It all boils down to JSON's ultimate lack of extensibility. This is a significant downside, but, on the other hand, it's also a strong feature of JSON.


> But XML is not ridiculously complicated. The 1.0 specification http://www.w3.org/TR/REC-xml/ is not very long.

Also note that the XML 1.0 specification contains the specification of the data format and of a schema language for the format itself (DTD). Without the definition of DTD, XML could be much much simpler.

I second Norman Walsh's proposal for XML 2.0: Just drop the <!DOCTYPE> declaration [1]. XML is pretty good as it is for its intended scope (marking up text), dropping the DOCTYPE/DTD would remove the main source of complexity and insecurity.

[1] http://www.tbray.org/ongoing/When/200x/2005/12/15/Drop-the-D...


One of the main problems isn't just encoding XML correctly, but the additive mistakes that arise when information is copied and reused, where along the way some part of the chain does a mistake:

- Information scraped from a web page that was in ISO-8859-1

- Stored in a database that is Windows-1252

- Then emitted through an API in UTF-8 by someone who writes strings by ("<tag>" + string concatenatation + "</tag>")

- Then stored in a new database as UTF-8 but not sanity-checked (ie., MySQL instead of Postgres)

- Then emitted as an XML feed

...etc. Along the way someone forgets to encode the "&" and the data contains random spatterings of ISO-8895-1 characters, and you're screwed.

Most parsers I have encountered aren't lenient by default, and will barf on non-conformant input. So now the last link in the chain needs to sanitize and normalize, which is a pain.


XML and JSON are different data formats with different properties and different uses.

Please, don't blame one and praise the other, just use what's appropriate. It's somehow like with data structures - say, one won't generally use a graph when he actually needs a set, right?

Trying to stash data into a semantically inappropriate format leads to kludges, and I'd say some JSON-based formats (for example, HAL-JSON) feel like so. Obviously, it's the same (or, possibly, even worse) for XMLish abominations like SOAP. Neither XML nor JSON are silver bullets.

My point is, while sure, XML has its issues and arcane features, it's not universally terrible. Wonder if there's some standardized "XML Basic Profile" that's a as minimal as possible yet still functional and expressive subset of XML for the most typical use cases, huh. Somehow in a same manner, XML was "extracted" from SGML.


The problem discussed are that XML is somewhat complicated to parse correctly. If you only support a subset of XML then it is easy to parse. The problem is that I might not know what subset you are using. When I send you advanced XML (since you said you support XML) something might crash. There is probably a need for a simpler subset of XML with its own name (like XML 2.0 or XML-WS or Mini-XML). Personally I would like to skip the requirement for a root element. Since json is a bit simpler it has less of these problems. Of course it has other problems but that is another thing.


Why the nasty words for SOAP/XML? I mean, it can be kind of unpleasant to work with, but it holds some major advantages over popular alternatives for building and consuming non-trivial web APIs. WSDL as a mechanism to describe serialization style, operations, endpoints, data types and enums, security bindings, etc., plus all the tooling it makes possible are quite powerful.


Wow, are some poor bastards still using WSDL somewhere? I thought it was long dead. Back in 2006 I wrote up a quick bit on why SOAP sucks, my experience of running a couple of Google public SOAP services. Basically none of the stuff you describe actually works. http://www.somebits.com/weblog/tech/bad/whySoapSucks.html


SOAP/XML, WSDL, and related tooling are not always a walk in the park, but they certainly haven't faded into history. Indeed there are plenty of pathological cases, and overblown, tool-bound APIs designed by architecture astronauts, but it is also possible to publish and consume them in a manner just as simple and straightforward as the examples suggest. Taking just one example from my own experience, Salesforce's SOAP/XML services have proven easy to use, sensibly managed and versioned, and far quicker than hand-building structures to manage "simpler" REST.


> Wonder if there's some standardized "XML Basic Profile" that's a as minimal as possible

If you ignore Schema (and all the WS-* stuff that requires them), the rest is as simple as it comes, even if you include DTDs. Which is why XML is used in a number of roles and people at one point thought it was the best invention since sliced bread.


XML is [...]* designed for editing text documents.*

Was this ever a stated design goal? SGML, sure, probably; but I've never seen any evidence that XML had "document markup" as its sole intended application.


I'm sure it's up for interpretation, but it's a reasonably defensible position. Here's what the XML 1.0 Spec has to say:

Abstract

  The Extensible Markup Language (XML) is a subset of SGML 
  that is completely described in this document. Its goal is 
  to enable generic SGML to be served, received, and 
  processed on the Web in the way that is now possible with 
  HTML. XML has been designed for ease of implementation and 
  for interoperability with both SGML and HTML.
  ...
1. Introduction

  Extensible Markup Language, abbreviated XML, describes a 
  class of data objects called XML documents and partially 
  describes the behavior of computer programs which process 
  them. XML is an application profile or restricted form of 
  SGML, the Standard Generalized Markup Language [ISO 8879].   
  By construction, XML documents are conforming SGML documents.

  XML documents are made up of storage units called entities, 
  which contain either parsed or unparsed data. Parsed data 
  is made up of characters, some of which form character 
  data, and some of which form markup. Markup encodes a 
  description of the document's storage layout and logical 
  structure. XML provides a mechanism to impose constraints 
  on the storage layout and logical structure.
  ...
1.1 Origin and Goals

  ... 
  The design goals for XML are:
  XML shall be straightforwardly usable over the Internet.
  XML shall support a wide variety of applications.
  XML shall be compatible with SGML.
  It shall be easy to write programs which process XML documents.
  The number of optional features in XML is to be kept to the absolute minimum, ideally zero.
  XML documents should be human-legible and reasonably clear.
  The XML design should be prepared quickly.
  The design of XML shall be formal and concise.
  XML documents shall be easy to create.
  Terseness in XML markup is of minimal importance.
http://www.w3.org/TR/1998/REC-xml-19980210


It seems to me that the real problem with XML is CDATA sections which have been used to cover a multitude of sins. CDATA should have been implemented as (say) normal XML with a reference (offset and length) to binary data at the end of the XML document. No special escape sequences would have been necessary, and the XML body could be kept clean.


The single and only interpretation that goes from that to "designed for editing text documents" is the fact that the authors called them "XML Documents". But XML Documents != "text documents". XML Documents are big property bags of data, and have no correlation with "editing text documents". The primary use of XML was, and remains, data exchange, where system A generates data that is consumed by system B.


> Fortunately we have JSON now.

Of course, we had s-expressions long ago.

But I agree about XML.


> Fortunately we have JSON now.

Even JSON is really a sin. Using simple binary data formats like protocol buffers makes so much more sense.


The fact that some people don't properly configure their parsers isn't an argument against the format.


It kind of is, though, since the format specifies a very non-obvious feature which may have serious security consequences if left enabled, which isn't easily discoverable for users starting with xml, and, frankly has little use. A better format would bring less surprises.


Yeah, many users would actually not like some of these features. If you're using XML to serialize static tree-structured data in a fixed schema (like people commonly use JSON), would you expect that your parser would be vulnerable to this?

https://en.wikipedia.org/wiki/Billion_laughs


"Some people" is a pretty big category with some good company. Google in April 2014, for instance. Or PostgreSQL in 2012. http://blog.detectify.com/post/82370846588/how-we-got-read-a... http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2012-3489


Facebook as well: http://www.ubercomp.com/posts/2014-01-16_facebook_remote_cod... XXE to Remote Code Execution. They paid out their largest bounty ever for it, $33,500.


When choosing a format to use for a real-world problem, the real-world parsing libraries and their real-world behaviour (including how they tend to be configured in practice) are important considerations.


If making your parser spec-compliant makes it more vulnerable to security problems that would tend to suggest something other than the parser is problematic.

See also: https://www.youtube.com/watch?v=PE9fXM7aOxo


If people are using an XML parser they're still ahead of the majority of the software business.

We have one partner that discovered a bug in our XML feed (developed by a consultant), the bug on surfaces because they actually use an XML parser. They're the first of our customers to find this bug, in 10 years. The rest simple view the feed as plain text.

The consulting company that did the feed generation original also have something against using XML parsers. The only code they have that doesn't just concatenate strings is a logging library (A library we asked them to stop using because ther way to doing XML logs are useless with Splunk and pretty much any other tools).


I once made a consumer for an xml api. I made a simple regex hack that worked fine. After reading some rants about "parsing" xml with regex I swapped it out for a real parser.

A few minutes later the parsings started failing because there were unescaped <> characters in attributes. Reported the bug, got a wontfix back.

I reverted to regex and it has been working fine ever since.


If there are unescaped <> chars in the "xml" then it's NOT xml and the API shouldn't be called "xml" but "plain text made to look a little like xml".

The people producing such garbage should be ashamed of themselves and should be publicly shamed.

Of course, as a consumer of the API we often don't have any power over the producer and have to swallow what we're given as is; but even in that case the correct approach is to have a first step of cleaning/correcting the xml (with something like Beautiful soup for example) and then feeding the clean xml to a proper parser.


What!? Of course it is. It might not be an argument that you buy, but it's definitely an argument.


Invalid argument then.


"Fortunately we have JSON now." Which doesn't support big ints. JSON isn't a silver bullet.


Sure it does:

  { "mybigint" : -123434580239458203948203982345723458 }
You can throw the spec out the window and put as many digits as you want into a JSON integer. Any half decent parser in a half decent language will accumulate the token and spit out an integer object.

There is no reason to write a JSON parser that doesn't accept bignum integers, in a language that has them.

There are only two reasons some JSON implementation doesn't accept them. One is that the underlying software doesn't handle them. In that case JSON is moot; it's a limitation of that system. You cannot input bignums into that system through any means; they are not representable. The other is that the JSON implementation was written by obtuse blockheads: bignums could easily be supported since the underlying language has them, but aren't simply for compliance with the JSON specification. Compliance can be taken to counterproductive extremes.


For example:

    import 'dart:convert';
    
    main() {
      print(JSON.decode('{ "mybigint" : -123434580239458203948203982345723458 }')['mybigint']);
    }
Prints -123434580239458203948203982345723458


What does it do for large floats? In Python:

    >>> import json
    >>> json.loads('{ "mybigfloat" : -1' + "8"*310 + '.0}')
    {u'mybigfloat': -inf}


> There is no reason to write a JSON parser that doesn't accept bignum integers, in a language that has them.

Couldn't you write that about about almost any data type in JSON? The entire point of having a standard for such things is so that you don't have to worry about the differences between individual parsers/serializers.


> Couldn't you write that about about almost any data type in JSON?

Yes, you could.

> The entire point of having a standard for such things is so that you don't have to worry about the differences between individual parsers/serializers.

Things like the maximum size of strings or integers are implementation limits. These should be kept separate from the language definition per se.

There are de facto different levels of portability of JSON data. A highly portable piece of JSON data confines itself not to exceeding certain limits, like ranges of integers. We could call that "strictly conforming".

JSON data which exceeds some limits is not strictly conforming, but it is still well-formed and useful.

Limits are different from other discrepancies among implementations because it is very clear what the behavior should be if the extended limit is supported. If an implementation handles integers beyond the JSON specified range, there is an overwhelming expectation that those representations keep denoting integers.

This is different from situations where you hit some unspecified or undefined behavior, where an implementation could conceivably do anything, including numerous choices that meet the definition of a useful extension.


> Things like the maximum size of strings or integers are implementation limits. These should be kept separate from the language definition per se.

At least with integers, I think protobuf's ability to specify 32-bit and 64-bit integers has been quite helpful.

> There are de facto different levels of portability of JSON data. A highly portable piece of JSON data confines itself not to exceeding certain limits, like ranges of integers. We could call that "strictly conforming".

Yeah... and that's how you get yourself in to problems. Life is simpler with one standard that either works or doesn't. Easily the most annoying thing with protocol buffers is using unsigned integers because of Java's signed integer foolishness. Yes, you can argue that's a reason to have the "strictly conforming" concept, but I'd argue quite the opposite.

> Limits are different from other discrepancies among implementations because it is very clear what the behavior should be if the extended limit is supported. If an implementation handles integers beyond the JSON specified range, there is an overwhelming expectation that those representations keep denoting integers.

Hmm... I don't think that is clear at all. In fact, it isn't clear to me when I have to worry about floating point rounding potentially kicking in.

> This is different from situations where you hit some unspecified or undefined behavior, where an implementation could conceivably do anything, including numerous choices that meet the definition of a useful extension.

In practice, there seems to be little difference. While there might be some idealized behaviour that is expected, there appears to be plenty of wiggle room for a variety of behaviours for these "not strictly conforming" cases.


> The entire point of having a standard for such things is so that you don't have to worry about the differences between individual parsers/serializers.

That makes this an interesting case. As pointed out elsewhere, adamtulinius is randomly passing on an easily-falsified urban legend; the JSON standard explicitly allows bignums. That doesn't leave much room for not having to worry about the differences between individual parsers; if the language doesn't allow bignums, it's not going to parse them no matter how much it wants to be standards-compliant.


I think, ironically, there are a lot of JavaScript based parsers that won't handle bignums all too well. ;-)


Where in the spec does it say that? http://json.org/ I see number, which can be an int, which can be digit1-9 digits, and digits is digit digits. No size limitation there.

In fact numbers have no semantic meaning applied to them in JSON. Probably the only reason to include them is to standardize their representation as a debugging aid.


The common approach I have seen/used is to encode big ints as a string instead. The lack of support for NaN and +/-Infininty means that doubles won't encode in JSON either.

On other trick is that because comments aren't supported is to make a keys named "//SomeComment":"", this will still be syntactically correct json and you can simply ignore them inside your program.

JSON is missing a lot of little things, but once you add them in you end up with something that is a lot harder to parse, which ultimately hurts the ubiquitous appeal of json.


The JSON parser in the Python standard library does support big ints, NaN, and +/-Infinity.

  >>> json.loads("NaN")
 nan
 >>> json.loads("Infinity")
 inf
 >>> json.loads("-Infinity")
 -inf
 >>> json.loads("1234567890"*20)
 123456789012345678901234567890123456789012345678901234567890
 123456789012345678901234567890123456789012345678901234567890
 123456789012345678901234567890123456789012345678901234567890
 12345678901234567890L
I haven't read the JSON spec, so I'm wondering whether this is actually standard JSON, whether it's an official extension of some kind to JSON, or whether the Python JSON parser is just being too clever.


Not standard.

> Numeric values that cannot be represented as sequences of digits (such as Infinity and NaN) are not permitted.

- http://www.ecma-international.org/publications/files/ECMA-ST...


Huh, the Python json module documentation admits that the Python behavior isn't standards-compliant.

https://docs.python.org/2/library/json.html#standard-complia...

It does claim that the support for arbitrary precision numbers is standards-compliant. Looking at the ECMA standard briefly, that seems to be correct to me: the official JSON grammar does allow arbitrary-precision, and disclaims restrictions on how a particular programming language interprets numbers.

I have to admit that I'm kind of concerned about the NaN and Infinity issue because I thought there was more uniformity among JSON parsers about what is or isn't legitimate JSON.


So far the only serialization format I've found which isn't stuffed with surprising quirks is protobufs.


One could make an argument that its zigzag format is mildly quirky, but in general, I'd agree. Protobufs have some surprising limitations, but their simplicity has limited their quirkiness.


That's true. There are some annoying limitations, though usually they're easy to work around at the app layer (e.g. no flexible type system).


Fair enough. Then again XML doesn't support any kind of int and yet we managed. (Unless you count XML Schema, in which case now you have N problems.)


As a Peano-player I'd represent the interger 3 in xml as

    <succ>
      <succ>
         <succ>
           <zero></zero>
         </succ>
      </succ>
    </succ>


Don't give them ideas...


Actually they split things up nicely so you can use XMLSchema part 2 (which contains the useful data types) without part 1 (which contains the N problems you mention).


I can't resist pointing out for a third time what colanderman and schoen have already said: JSON supports bigints exactly the same as small ints. Where did you get the idea that it didn't?


Ahh, grasshopper. Truth is, there is more than one "truth".


Meh, since so many downvotes, that was a smart-ass way of saying "put the integer in quotes so it transfers".


designed for editing text documents

This sort of argument has never held water, and it constantly perplexes me that it gets a pass in tech discussions. Even if we accept the dubious assertion that it was "designed" for editing text documents, the origins of some invention say literally nothing about its utility for a purpose.

Further, as complex as XML may be, comparing the robust, rich, diverse ecosystem of XML support with the amateur hour, barely credible JSON world is quite a contrast. JSON doesn't even have a date type (and everyone seems to roll their own). It lacks any sort of robust validation or transformation system: XML schemas are really one of XML's greatest features, and the nascent, mostly broken similes in JSON world don't compare.

JSON versus XML is a lot like NoSQL versus RDBMS -- the former is easier to pitch because its complete absence of a wide set of functionality seems like it makes implementations easier, when really it just pushes the complexity down the road.


I like XML for structured data. I also like XSL for self-documenting format to format conversion (or display). However I am going to respectfully disagree with you about XML schemas, they suck...

XML schemas don't "understand" XML the language, and as a result you cannot use schemas unless you're using XML wrongly. For example, in XML:

     <myroot> 
     	<thing>
     		<name>Robot</name>
     		<size>15</size>
     	</thing>
     	<person>
     		<name>John Smith</name>
     		<id>12345</id>
     	</person> 
     </myroot>
And:

     <myroot>
     	<person>
     		<name>John Smith</name>
     		<id>12345</id>
     	</person> 
     	<thing>
     		<name>Robot</name>
     		<size>15</size>
     	</thing>
     </myroot>
These two XML documents produce identical DOM and or XPaths. The fact that person and thing are inverted in the second example is irrelevant, but now try to write a schema definition which supports either thing/person or person/thing ordering.

It is possible to do with just one node switch in the above example, but convoluted. As you add more and more nodes the definition becomes unmanageable, imagine 10+ nodes on the same "level" but they can be ordered randomly, and are all required (and your schema needs to check they exist).

XML schema sucks, really sucks, when you use XML correctly and ignore node order. So if you choose to use a schema you're then forced to also require everything in an exacting order which can be problematic when XML is automatically generated from dozens of different systems or people.

XML schema sucked so bad we wrote our own replacement in Java. It just generates an XML DOM tree (via the standard libraries), then it parses your "schema" file which is just a list of XPaths with either a required, optional, or excluded flag. It is like 20 lines of code and it is better than XML Schema language.


No idea about "XML Schema", but it's not the only schema language and Relax NG "interleave" should do what you want.

http://relaxng.org/spec-20011203.html#IDA2Z0R


As far as I understand this is a limitation of the underlying processing model (tree automata and such, although I'm not an expert). That is I suspect it would be equally hard to describe such an unordered collection of required elements using a context-free grammar. (On the other hand Relax NG seems to handle it nonetheless.)


As someone not intimately familiar with xml wouldn't xs:all do what you need? It provides unordered but required.


In theory yes. In practice absolutely no. The examples are oversimplified, and when you start using xs:all or xs:choice you quickly start running into inheritance issues (e.g. the top level node is optional, but when it exists child-XYZ are mandatory, and when you mix child optional and child mandatory things get "strange").

I spent a few days trying to get a schema working with our format. I had it working well enough with a specific node ordering, but when I tried to get it working with an order-agnostic set it would all break down around you.

That's when I threw out two full days of work and replaced it with a Java class I spent about an hour writing and testing. Working flawlessly ever since.


XML schema sucks, really sucks, when you use XML correctly and ignore node order.

How is ignoring node order "correct" (and on the flip side, how is demanding an order using it "wrongly")? In many XML parser implementations, node order is reflected in the resulting DOM. Yes, a schema lets you demand that nodes come in a specific order, and this is a complete non-problem, generally -- the whole point of schemas is that they very tightly define the format, and source systems abide by it, even if that means they have to alter the generating order to be compliant.


With JSON you use a programming language to transform your data. You also use a programming language to validate your data. Validating and transforming data are some of the most basic capabilities of all general-purpose programming languages, and reinventing these facilities badly in a not-exactly-programming-language like XSLT or XML Schema is a waste of time.


This is same as saying that one shouldn't use parser generators or even describe the grammar of a language in Backus-Naur form. Any language is usually very well equipped with tools to read strings and BNF is absolutely not a programming language. Is it a waste of time?

If you validate a couple of document formats, then maybe, but if you need to validate many different formats on a regular basis you'll soon want to split your program into 1) a language that describes the rules and 2) an interpreter that takes such a description and validates a document against it. DTD, XML Schema, and Relax NG are just such tools.


> XML schemas are really one of XML's greatest features

So great that you get to choose between the schema system that's horribly complex (XML Schema) and the schema system that no one uses (DTDs).

> JSON versus XML is a lot like NoSQL versus RDBMS -- the former is easier to pitch because its complete absence of a wide set of functionality seems like it makes implementations easier, when really it just pushes the complexity down the road.

XML is like CORBA: a complicated mess whose proponents seem genuinely unaware that it's possible to achieve the same goals in a much simpler way (like Protocol Buffers).


There is nothing complex about xml schema. It is so profoundly simple, the tools so well worn and proven, that anyone who pronounces difficulty with it is probably in the wrong field.

And the notion that I must be unaware of protocol buffers (which has utterly nothing to do with the JSON vs xml conversation, but I suppose is the bigger brother) is laughable nonsense. Yes, protocol buffers have some advantages (and some significant disadvantages), but that is an entirely different discussion, and the attempt to try to pull it in, simplifying it as if it is some sort of solid point (protip: it isn't even close to a trump card. It has advantages, but it certainly doesn't invalidate xml), is transparent nonsense.

Whatever your personal hangups with xml, your scattered and somewhat desperate attempts to try to denigrate anyone who doesn't denounce it is grossly out of place on any place where professionals converse.


Whoa, you have taken my message a lot more personally than I intended my tone to land. I mean, I was being a little bit snippy about technology, but it wasn't meant at all as a personal attack. Sorry if it came off that way.

I wasn't meaning to say that you were unaware of Protocol Buffers, my argument is just Protocol Buffers do achieve most of what XML is trying to achieve, and I think many people don't appreciate that most of the complexity of XML is not inherent to the problem domain, but rather a product of an over-engineered technology stack that is not a good match for the problems to which it is usually applied.

A person who is marking up a text document, like with DocBook, has (in my opinion) the best argument for using XML. For more data structure-oriented programming like SVG or web APIs, I believe Protocol Buffers are better in every way, and I think I can defend that argument in rigorous debate.

If you go to the XML Schema website and print up the two specs there (Part 1 and Part 2), the printed spec is 371 pages. I think it is difficult to make the argument that it is "profoundly simple." The JSON RFC, at 10 pages, would deserve that characterization.


XML Schema describes a language to define a CF-like grammar and a set of extensible types, while JSON is a serialization notation for a data model that has strings, integers, real numbers, booleans, arrays, and dictionaries. No wonder they're different.


I've been following posts about this tool for a few weeks and it is really remarkable how many interesting results are already popping out already. In particular since static analyzers have been around for years and years.

I'm assuming afl-fuzz is particularly CPU-bound, and it would be interesting to see some numbers about how many CPU years are being dedicated to it at the moment - and if we would see even more interesting stuff if a larger compute cluster was made available.

It's also super scary how "effortlessly" these bugs appear to be uncovered, even in "well-aged" software like "strings".


It would be pretty cool to have a public cluster that anyone can submit jobs to that are prioritized based on amount of donated CPU cycles. Instead of "Seti at home" it would be "fuzz at home".


I don't think that would be that efficient of a use of computing resources. Each instance explores the same instruction space. It keeps track of where it's explored, and uses various techniques to explore different parts of space.

It's very likely that multiple instances, if run in parallel and with no data sharing, will explore a lot of the same space.

Also, making a public cluster would be a security challenge. It runs arbitrary C/C++ code, and can trigger code paths that the developers didn't even realize. How would your box stand up to multiple grabs of 4GB of memory?


Don't have instances explore the same programs starting from the same state. Have state reported back in intervals so other workers can be scheduled to resume the work if that one goes down.

Security would be an issue. A VM would probably be a hard requirement. That can bound the memory usage, hardware calls, etc.


Even starting from different states doesn't help, because afl uses semi-random search techniques. It's very likely that different start locations will still have a large overlap.

Nor is it obvious that state reporting is useful. I ran afl for about 4 days. It ran my test code about 1,000 times per second, for a total of nearly 1/2 billion test cases.

That's a lot of data exchange for each program to report and resynchronize.

I'm not saying it's impossible. I'm suggesting that it's likely not worthwhile. It would be better to support multiprocessing first, before looking towards distributed computing.


As I understand it, it works by exploring different codepaths. So you just assign each computer to work on one seed file or codepath. Each as different as possible covering the largest area possible. After awhile then decide what paths are the most interesting to explore and reassign.


afl-fuzz can be parallelized fairly easily. The exchanged data amounts to newly-discovered, interesting inputs that then seed the subsequent fuzzing work.


Great work with afl. I tried it out last week, and found two segfaults and a stack smash detection in one program. I tried it on another program, only to have gcc crash with an internal error. :(

By parallelized, do you mean on the same machine or across a distributed cluster? If they only share the same set of interesting inputs, won't the different nodes also end up searching much of the same space? .... Hmm, no I see how I could be wrong. With interesting seeds, boring space is easy to re-identify, so there's a trivial amount of duplicate work, and the rest is spent just trying to find something new and interesting.


Recently I find it harder and harder to believe that lcamtuf is just one person.


He just started running afl-fuzz few years ago and redirects fitting outputs as blog posts.


Running it against an AI simulation of a human programmer, pattern-matching for the output "Wow that's awesome", no doubt.


No kidding. Security work aside, he finds time to take up time-intensive hobbies like CNC milling for robot parts, and has the time to write up comprehensive documents about the hobby?! (http://lcamtuf.coredump.cx/gcnc/)

Maybe he doesn't sleep.


Just in case you're serious, he's been doing security for ages. As well as other interesting things - check out http://lcamtuf.coredump.cx


He does all that and has two young kids. There goes my excuse.


Heads-up to the "comment without reading the article" crowd: the title is not bemoaning a lack of handling for CDATA in existing parsers. It's discussing an interesting behavior of the AFL fuzzer when used with formats that require fixed strings in particular places...

Related: NOBODY EXPECTS THE SPANISH INQUISITION, either. :)


This is completely tangential, but I'm waiting for someone to create a breakfast cereal called Funroll Loops. You know, for the kids.


How long till afl-fuzz reaches consciousness?


About 13 years, if all goes as planned. Then another 5 after that until we are all running afl-fuzz.


Wow, what an enjoyable read. I recommend the story about randomly generating JPG files too.


This thread reminded me of a draft post I've been sitting on for a while, related to ENTITY tags in XML and XXE exploits.

Basically, it's really easy to leave default XML parsing settings (for things like consuming RSS feeds) and accidentally open yourself up to reading files off the filesystem.

I did a full write-up and POC here: http://mikeknoop.com/lxml-xxe-exploit


I'm actually not so surprised, given what the fuzzer does - mutating input to make forward progress in the code. Incremental string comparisons definitely fall under this category since they have a very straightforward definition of "forward progress"; either the byte is correct and we can enter a previously unvisited state, or it's incorrect and execution flows down the unsuccessful path. It's somewhat like the infinite monkey theorem, except the random stream is being filtered such that only a correct subsequence is needed to advance.

On the other hand, I'd be astonished if it managed to fuzz its way through a hash-based comparison (especially one involving crypto like SHA1 or MD5.)


It's kind of like breaking a password if you only have to guess 1 letter at a time until you get it right. Reminds me of the Weasel program: https://en.wikipedia.org/wiki/Weasel_program

It's just the simplest possible demonstration of evolution, where characters of a string are randomly changed, and kept if more of the characters match. In a short amount of time you get Shakespeare quotes.

Obviously hashes are designed to be difficult to break. Although I've never heard of anyone trying a method like this before. I've heard of people using things like SAT solvers to try to reason backwards what the solution should be. But this is the reverse, it's trying random solutions and propagating forward to see how far they get.

I doubt it would work, I'm just curious to know if this has been tried before and how well it does.


The problem is that a good, side-channel resistant implementation would always do the same amount of computation and fail at the same place. You wouldn't get any information out of your attempts.


In a way it reminds me of timing attacks only with a lot stronger signal.


Yeah, hashes or even CRC codes would be non-starters... Unless the hash or cdc was stored in the input being fuzzed, then it's just a matter of iterating over the hash byte by byte.

Constant-time compares however would probably stump the fuzzer.


It can't. If you download the package you'll see it includes an example of patching PNG, as otherwise the CRC as the end of each block prevents afl from doing much at all.


But of course no one uses either when there's Atom/GitHub's favorite: CSON. https://github.com/bevry/cson


Tell the people that created the webservice I have to consume this!


I didn't expect a kind of Spanish Inquisition...


Maybe C based XML parsers don't, but JVM and .NET based XML parsers don't have any issues with CDATA sections.

Time to upgrade to more modern tools?


The reasons why we are still relying a lot on software written in low-level languages have been discussed to death, and are quite orthogonal to the insight in the article, which is that seemingly lo-tech techniques can discover much about an opaque, potentially vulnerable piece of software. And even some seemingly insurmountable difficulties (“the algorithm wouldn't be able to get past atomic, large-search-space checks such as …”) may simply, with a bit of luck, fail to materialise.

Still, quoting from a sentence a few lines down in the article:

“this particular example is often used as a benchmark - and often the most significant accomplishment - for multiple remarkably complex static analysis or symbolic execution frameworks”

The author is thinking of backwards-propagation static analysis or symbolic execution frameworks, for which is it indeed a feat to reverse-engineer the condition that leads to exploring the possibility that there is a “CDATA” in the input. Forwards-propagation static analysis needs no special trick to assume that the complex condition must be taken some of the times and to visit the statements in that part of the code. The drawback of static analysis (especially with respect to fuzzing) is then with the false positives that can result from the fact that a condition was partially, or not at all, understood.


It's not much relevant to the article as the author doesn't imply CDATA is poorly supported (and that's not the topic at hand) but CDATA sections are very common in RSS files, as a way to shoehorn text of any type into various elements, so I'd be surprised if any well used parser lacked support.. it's even more of a requirement than namespace support IMHO.


Have you actually read this post?


Did you read the article?


you didnt even read the article did you


The article... did you read it?


I am not sure but what is the actual harm of it?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: