I would be quite interested in Gumbo as the backend to the awesome pure Python b...

nostrademons · on Aug 14, 2013

Gumbo's Python wrapper should be a drop-in replacement for html5lib. Just replace

     import html5lib

with

     from gumbo import html5lib

The tree generated from gumbo.html5lib.HTMLParser should be API-compatible with the one generated by html5lib.HTMLParser. (Possibly modulo some minor features...html5lib's maintainer has filed a bug about implementing treewalkers in the html5lib adaptor.)

I'm not sure offhand what the speed would be - I'd imagine the Gumbo backend would be significantly faster than html5lib by virtue of being written in C, but speed was not a design goal, and so I suspect it's currently significantly slower than lxml. What Gumbo gives you over lxml is HTML5 compatibility - lxml does an HTML4-approximate parse.

gsnedders · on Aug 14, 2013

Well, differences off hand compared with html5lib:

- Byte strings (opposed to Unicode ones) have encoding sniffed and parsed according to that in html5lib whereas they're all handled as UTF-8 in Gumbo.

- There's a namespaceHTMLElements option in html5lib which avoids putting HTML elements in the HTML namespace, useful for some legacy HTML processing tools.

- html5lib can read directly from a file object, which might in extreme cases be a useful memory saving (though the parse tree will likely use 100x the amount of memory anyway), but perhaps is more useful when dealing with network streams (it doesn't block waiting for all the data before starting to parse).

- html5lib supports fragment parsing, as is used by innerHTML.

Otherwise, given it takes a normal html5lib tree builder, it should support almost everything else (the tree walkers, albeit with indirection from Gumbo's own representation of the tree, and related stuff like the serialiser).

Compared with libxml2, it provides what is likely a better tested parse algorithm (ultimately, libxml2's is just a few bits of error handling of the non-fatal type in the libxml2 parser with a few bits of variant behaviour. I know the experience of HubHub's author was it had a fair few bad bugs like infinite loops and the like, as well as radically different behaviour to any browser and what most web authors expect to get.

Speed wise, quickly trying to appears to be a few times quicker than html5lib under PyPy and an order of magnitude quicker under CPython. This will likely differ with the input given.

gsnedders · on Aug 15, 2013

Okay, digging about some more, and actually running Gumbo in its html5lib wrapper, it appears no quicker than html5lib itself (the cost of the tree building dominates the actual parsing). :(

Smerity · on Aug 14, 2013

They already provide adapters for standard Python HTML parsing libraries[1], specifically html5lib and BeautifulSoup. This is how they suggest it be used with Python[2].

[1]: https://github.com/google/gumbo-parser/tree/master/python/gu...

[2]: https://github.com/google/gumbo-parser#python-usage

voltagex_ · on Aug 14, 2013

From a quick glance, it looks like you'd need to make Gumbo a backend for lxml. Is that even possible?

bryanh · on Aug 14, 2013

https://github.com/html5lib/html5lib-python/issues/105 seems to imply that such a thing is possible. I am unsure about the requirement for lxml. I was under the impression that lxml is an optional walker, the default is the slower pure python walker.

gsnedders · on Aug 14, 2013

You're misunderstanding the level at which html5lib operates: it merely parses to a tree (using a "tree builder" to provide a common API to the parser to build the tree, which can be a DOM tree, an ElementTree, an lxml tree, whatever) and provides a generic "tree walker" API that walks over one of those tree formats and provides a common stream of events (start tag, end tag, text, comment, etc.) which can then be used, e.g., in the serialiser.

This can therefore be used with Gumbo by passing the lxml tree builder into its html5lib.parse like method.