Parsing huge XML files with Go

willvarfar · on June 19, 2012

Personally, I have a penchant for writing my own pull parsers. Its a mind-expanding exercise.

The neat thing about Go is that parsers can return functions that consume the next token. Rob Pike has an excellent video about this: http://www.youtube.com/watch?v=HxaD_trXwRE

signa11 · on June 19, 2012

> Rob Pike has an excellent video about this: http://www.youtube.com/watch?v=HxaD_trXwRE

thank you ! this is an excellent talk. having concurrent implementation of lexer & parser as co-routines communicating over message channels is very, very cool.

pjmlp · on June 19, 2012

As in any language that supports functions as first class objects.

_3u10 · on June 19, 2012

Streaming parsers are key when dealing with XML files this big. Used to have a C# parser that would parse about 1 TB of XML per day the biggest files were > 200GB.

It was impossible with out rewriting everything to use a SAX style parser.

barrkel · on June 19, 2012

SAX style (parsing library callbacks) is not your only option; you can also use an iterator style (i.e. something like XmlReader in C#).

_3u10 · on June 19, 2012

Oops, mistook SAX for iterator, I really prefer XmlReader to SAX style.

masklinn · on June 19, 2012

> you can also use an iterator style (i.e. something like XmlReader in C#).

This category is generally called "pull parsers" (as opposed to SAX-style event-driven parsers)

Jimmie · on June 19, 2012

I'm curious, what data did the XML files contain?

duaneb · on June 19, 2012

As much as I like hearing about Go, SAX parsers are not exactly new.

iand · on June 19, 2012

I think showing the convenience of parsing into tagged structs makes this a cut above the usual SAX parsing examples.

duaneb · on June 19, 2012

Hm, good point. Negativity redacted.

exim · on June 19, 2012

In the first place, why should you have huge XML files? (Except those wikipedia dump files :))

human_error · on June 19, 2012

It happens sometimes. I had written a multithreaded parser in C++ to parse around 800MB per day so another team could build up the rest of the project based on the data. Someone had thought it'd be a better idea to store all fetched data in XML.

willvarfar · on June 19, 2012

XML is often used when migrating datasets, large and small. Interchange between disparate systems is the very thing its good for.

archangel_one · on June 19, 2012

OpenStreetMap is another example that uses huge XML files. I'm not sure I really like the idea, but it does happen, and if you need the data then you have to be able to deal with it somehow even if you don't like the format.

mercurial · on June 19, 2012

I had to deal with relatively large XML dumps containing dictionary data.

pradeepprabakar · on June 19, 2012

I had to do a similar task of parsing the huge wikipedia dump and rewriting the Wikipedia XML (I had to add a couple of other tags to the main "page" tag) I used a SAX parser in Python and rewrote the dump. I found SAX parsers very simple to deal with huge XML Streams.