I hate XML as much as the next guy but in this case I would blame a poorly desig...

pwg · on March 31, 2014

> There's no reason why any file parsing library would end up fetching remote data without being explicitly asked to do so.

I would agree. But the XML spec. authors clearly disagreed with both of us:

External Entities:

XML 1.0: http://www.w3.org/TR/2008/REC-xml-20081126/#sec-external-ent

XML 1.1: http://www.w3.org/TR/2006/REC-xml11-20060816/#sec-external-e...

This is "XML speak" for an "include" statement to something like cpp, with the exception that this "include" could end up performing remote network fetches to acquire that which is being included.

So, technically, to be a proper, standards compliant XML parser, the parser has to at least submit requests to "fetch" these entities to the higher level code using the library, and let that code decide what to do about the "includes".

As to why Microsoft's implementation is the way it is, absent a Raymond Chen blog post explaining the why, we can only guess.

acqq · on March 31, 2014

Ever tried to look in the documents saved by Libre Office or MS Office? They all use XML now. The ODT document with only two words in it has at the start of the content.xml inside of ODT this beauty:

    <?xml version="1.0" encoding="UTF-8"?>
    <office:document-content
    (...)
    xmlns:xlink="http://www.w3.org/1999/xlink" 
    xmlns:dc="http://purl.org/dc/elements/1.1/" 
    xmlns:math="http://www.w3.org/1998/Math/MathML" 
    xmlns:ooo="http://openoffice.org/2004/office" 
    xmlns:ooow="http://openoffice.org/2004/writer" 
    xmlns:oooc="http://openoffice.org/2004/calc" 
    xmlns:dom="http://www.w3.org/2001/xml-events" 
    xmlns:xforms="http://www.w3.org/2002/xforms" 
    xmlns:xsd="http://www.w3.org/2001/XMLSchema" 
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
    xmlns:rpt="http://openoffice.org/2005/report" 
    xmlns:xhtml="http://www.w3.org/1999/xhtml" 
    xmlns:grddl="http://www.w3.org/2003/g/data-view#" 
    xmlns:officeooo="http://openoffice.org/2009/office" 
    xmlns:tableooo="http://openoffice.org/2009/table" 
    xmlns:drawooo="http://openoffice.org/2010/draw"

It's not Microsoft specific bug that ate some brains.

yuhong · on March 31, 2014

These are namespace directives, not references to external DTDs.

acqq · on March 31, 2014

The problem is, whenever you have some link anywhere and it is assumed that it should be refreshed sometimes, how can you know that you shouldn't load the more current version? If you write something like a DLL or library why not leave it to the expert: let the IE try to fetch it, and if it already fetched, it will return it from its own cache! Brilliant, problem solved! Except when that happens from 144 instances all the time and IE needs some windows creations at the start which is what Bruce seems to manage to trigger.

shawnz · on April 1, 2014

To expand on what the parent commenter was saying: namespace directives aren't meant to be accessed; they are just used as a unique identifier.

acqq · on April 1, 2014

Yes. I'm old enough to remember the times though when even some of the creators of the standard still thought otherwise:

https://web.archive.org/web/20060613090946/http://textuality...

"Given that namespaces have definitive material, and that such definitive material is typically available on the Web, and that namespace names may be "http:"-class URIs, it is a grievous waste of potential if it is not possible to use the namespace name in retrieving the definitive material."

And in order to do all the processing and transformations popular at the time somewhere there should be the copies of the documents specified with the URI's. Bruce detected some loads from some documents stored in the DLLs, locally.

i386 · on April 1, 2014

The amount of times I've fixed bugs as a direct consequence to this is simply astounding.

Two favourites: 1) App never started because it couldn't access the internet to fetch a DTD/XSD. 2) Sun/Oracle removed XSDs and the app refused to start.

acqq · on March 31, 2014

> There's no reason why any file parsing library would end up fetching remote data

It's not the API, it is the part of many of the specs, as it was thought to be a good idea once, specifically many standardizations involved additional definitions located at the http servers, example:

http://msdn.microsoft.com/en-us/library/aa468557.aspx

Which was "poised to play a central role in the future of XML processing, especially in Web services where it serves as one of the fundamental pillars that higher levels of abstraction are built upon."

So you had to implement it to be "conforming," and then to avoid overheads as "optimizations." Ironically, "/optimize" feature isn't optimized.

The reason this it's not discovered earlier is that the Visual Studios which contain that option were priced more thousands of dollars (I don't know the what the currently cheapest version containing "/optimize" is -- anybody knows?).