1. Hubbub provides a SAX-style (callback-based) API, while Gumbo gives you a DOM-style (tree-based) struct directly. Hubbub is likely faster in this regard, Gumbo is easier to use out-of-the-box.
2. Gumbo is better tested. It's unclear whether Hubbub's 90% test coverage is "90% of the code is tested" or "90% of the tests pass", but Gumbo has 100% code coverage, 100% of html5lib tests pass (as of 0.95; the html5lib maintainer has pointed out that additional tests were added to trunk recently that don't pass), and it's run without crashing on ~4.5B documents from Google's index.
3. Gumbo has better support for source locations and going between original text and parse tree.
4. Hubbub has character encoding detection, Gumbo doesn't.
It's an xml parser that is neither DOM nor SAX. I haven't seen much mention of it before, except as a recommendation for Java devs. There's a C version too. It makes bold claims about performance.
"Comparing with DOM, VTD-XML is significantly faster (up to 10x), more memory-efficient (up to 5x).
Comparing with SAX/PULL, VTD-XML is not only faster, but also is capable of random-access, therefore is easier to use."
Basically by building a DOM style model in SAX fashion.
Interesting. I'm not familiar with the library. I'm familiar with the general programming model of parsing a document into a number of tokens and then encoding document structure into offsets between tokens.
It wouldn't have worked for Gumbo's purposes because
1. Gumbo captures a lot more information than can fit in a 64 bit token. For example, Gumbo decodes entity references; this requires that text be available in a fresh buffer because each individual character might be something different than the source text.
2. One of Gumbo's goals was to make it easy to write bindings in other languages. Most languages can bind to C structs easily, but binding to C function calls often requires a much more verbose preamble to setup args, return types, conversions, etc. (I was actually thinking of LLVM when I designed Gumbo's API, since the project it was initially for at the time was looking at LLVM as an embedded JIT. Binding to a struct that's C-formatted just requires defining a new type, but binding to a function call requires codegenning a lot of argument setup.)
It's a shame, I wish vtd-xml was a more popular library, so I could read more about it rather than have to do it myself. libxml2 seems to rule the roost. vtd-xml doesn't have a debian package and the C files gave a lot of warnings when I compiled. I don't know enough about its performance to say if the bold claims are true. The author says the Java version is a little faster than the C version, which strikes me as odd - I wonder is he basing that on long duration benchmarks.
I wasn't suggesting that you should have used the approach, I was wondering if you had used the approach. I've learned a little bit about the limits of this tokenising parser method, thanks for your reply.
EDIT: Badgar thanks for your comment, I'll search out your lecturer's work if I ever have to parse something. Shame your account seems banned or something. I looked through your history, and it was over saying you had trouble quitting weed or something stupid.
FWIW my house mate kicked his weed addiction by cutting out triggers: people, places and things that would encourage him to light one up. He had all the problems with it you list. He had to stop drinking for a while to have sufficient willpower. He resumed drinking after successfully kicking weed.
My undergrad thesis advisor, Bill McKeeman, wrote his parsers in this fashion. I implemented a parser using this model and extended its existing lexer.
The token stream is an array of 32 bit integers, each of which is the token type bitmasked onto an index into the input file of the start of the token. If you need the token text, you reparse. Caches can be implemented as small hash maps from token index to cached value.
The canonical AST is the CFG parse tree with fixups to convert recursion to children of a node type for a variable number of children. It is stored as an array of integers. The node is a sequence of integers, with one integer for the root followed by each child in sequence. Each internal node's value is the index of the CFG rule evaluated to produce the node, and the children of the node correspond to the CFG rule's right hand side (minus keywords). Terminals are stored as integer indexes into the token stream. Nonterminals are the integer index of the child internal node.
Bill has been a big name in compilers for the better part of 5 decades now, and he said he's been using this pattern for almost as long. It's ridiculously fast, which is why he used it decades ago for DEC.
1. Hubbub provides a SAX-style (callback-based) API, while Gumbo gives you a DOM-style (tree-based) struct directly. Hubbub is likely faster in this regard, Gumbo is easier to use out-of-the-box.
2. Gumbo is better tested. It's unclear whether Hubbub's 90% test coverage is "90% of the code is tested" or "90% of the tests pass", but Gumbo has 100% code coverage, 100% of html5lib tests pass (as of 0.95; the html5lib maintainer has pointed out that additional tests were added to trunk recently that don't pass), and it's run without crashing on ~4.5B documents from Google's index.
3. Gumbo has better support for source locations and going between original text and parse tree.
4. Hubbub has character encoding detection, Gumbo doesn't.