Hacker News new | past | comments | ask | show | jobs | submit login
Parsing huge XML files with Go (davidsingleton.org)
44 points by dps on June 19, 2012 | hide | past | favorite | 17 comments



Personally, I have a penchant for writing my own pull parsers. Its a mind-expanding exercise.

The neat thing about Go is that parsers can return functions that consume the next token. Rob Pike has an excellent video about this: http://www.youtube.com/watch?v=HxaD_trXwRE


> Rob Pike has an excellent video about this: http://www.youtube.com/watch?v=HxaD_trXwRE

thank you ! this is an excellent talk. having concurrent implementation of lexer & parser as co-routines communicating over message channels is very, very cool.


As in any language that supports functions as first class objects.


Streaming parsers are key when dealing with XML files this big. Used to have a C# parser that would parse about 1 TB of XML per day the biggest files were > 200GB.

It was impossible with out rewriting everything to use a SAX style parser.


SAX style (parsing library callbacks) is not your only option; you can also use an iterator style (i.e. something like XmlReader in C#).


Oops, mistook SAX for iterator, I really prefer XmlReader to SAX style.


> you can also use an iterator style (i.e. something like XmlReader in C#).

This category is generally called "pull parsers" (as opposed to SAX-style event-driven parsers)


I'm curious, what data did the XML files contain?


As much as I like hearing about Go, SAX parsers are not exactly new.


I think showing the convenience of parsing into tagged structs makes this a cut above the usual SAX parsing examples.


Hm, good point. Negativity redacted.


In the first place, why should you have huge XML files? (Except those wikipedia dump files :))


It happens sometimes. I had written a multithreaded parser in C++ to parse around 800MB per day so another team could build up the rest of the project based on the data. Someone had thought it'd be a better idea to store all fetched data in XML.


XML is often used when migrating datasets, large and small. Interchange between disparate systems is the very thing its good for.


OpenStreetMap is another example that uses huge XML files. I'm not sure I really like the idea, but it does happen, and if you need the data then you have to be able to deal with it somehow even if you don't like the format.


I had to deal with relatively large XML dumps containing dictionary data.


I had to do a similar task of parsing the huge wikipedia dump and rewriting the Wikipedia XML (I had to add a couple of other tags to the main "page" tag) I used a SAX parser in Python and rewrote the dump. I found SAX parsers very simple to deal with huge XML Streams.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: