Personally, I have a penchant for writing my own pull parsers. Its a mind-expanding exercise.
The neat thing about Go is that parsers can return functions that consume the next token. Rob Pike has an excellent video about this: http://www.youtube.com/watch?v=HxaD_trXwRE
thank you ! this is an excellent talk. having concurrent implementation of lexer & parser as co-routines communicating over message channels is very, very cool.
Streaming parsers are key when dealing with XML files this big. Used to have a C# parser that would parse about 1 TB of XML per day the biggest files were > 200GB.
It was impossible with out rewriting everything to use a SAX style parser.
It happens sometimes. I had written a multithreaded parser in C++ to parse around 800MB per day so another team could build up the rest of the project based on the data. Someone had thought it'd be a better idea to store all fetched data in XML.
OpenStreetMap is another example that uses huge XML files. I'm not sure I really like the idea, but it does happen, and if you need the data then you have to be able to deal with it somehow even if you don't like the format.
I had to do a similar task of parsing the huge wikipedia dump and rewriting the Wikipedia XML (I had to add a couple of other tags to the main "page" tag) I used a SAX parser in Python and rewrote the dump. I found SAX parsers very simple to deal with huge XML Streams.
The neat thing about Go is that parsers can return functions that consume the next token. Rob Pike has an excellent video about this: http://www.youtube.com/watch?v=HxaD_trXwRE