I would be quite interested in Gumbo as the backend to the awesome pure Python but otherwise rather-slow https://github.com/html5lib/html5lib-python, which actually has great whitelisting/cleaning facilities but is easily an order of magnitude slower than lxml's more limited clean_html.
PyPy JIT and html5lib is about 8x faster as it is cpython.
Gumbo's Python wrapper should be a drop-in replacement for html5lib. Just replace
import html5lib
with
from gumbo import html5lib
The tree generated from gumbo.html5lib.HTMLParser should be API-compatible with the one generated by html5lib.HTMLParser. (Possibly modulo some minor features...html5lib's maintainer has filed a bug about implementing treewalkers in the html5lib adaptor.)
I'm not sure offhand what the speed would be - I'd imagine the Gumbo backend would be significantly faster than html5lib by virtue of being written in C, but speed was not a design goal, and so I suspect it's currently significantly slower than lxml. What Gumbo gives you over lxml is HTML5 compatibility - lxml does an HTML4-approximate parse.
Well, differences off hand compared with html5lib:
- Byte strings (opposed to Unicode ones) have encoding sniffed and parsed according to that in html5lib whereas they're all handled as UTF-8 in Gumbo.
- There's a namespaceHTMLElements option in html5lib which avoids putting HTML elements in the HTML namespace, useful for some legacy HTML processing tools.
- html5lib can read directly from a file object, which might in extreme cases be a useful memory saving (though the parse tree will likely use 100x the amount of memory anyway), but perhaps is more useful when dealing with network streams (it doesn't block waiting for all the data before starting to parse).
- html5lib supports fragment parsing, as is used by innerHTML.
Otherwise, given it takes a normal html5lib tree builder, it should support almost everything else (the tree walkers, albeit with indirection from Gumbo's own representation of the tree, and related stuff like the serialiser).
Compared with libxml2, it provides what is likely a better tested parse algorithm (ultimately, libxml2's is just a few bits of error handling of the non-fatal type in the libxml2 parser with a few bits of variant behaviour. I know the experience of HubHub's author was it had a fair few bad bugs like infinite loops and the like, as well as radically different behaviour to any browser and what most web authors expect to get.
Speed wise, quickly trying to appears to be a few times quicker than html5lib under PyPy and an order of magnitude quicker under CPython. This will likely differ with the input given.
Okay, digging about some more, and actually running Gumbo in its html5lib wrapper, it appears no quicker than html5lib itself (the cost of the tree building dominates the actual parsing). :(
They already provide adapters for standard Python HTML parsing libraries[1], specifically html5lib and BeautifulSoup. This is how they suggest it be used with Python[2].
https://github.com/html5lib/html5lib-python/issues/105 seems to imply that such a thing is possible. I am unsure about the requirement for lxml. I was under the impression that lxml is an optional walker, the default is the slower pure python walker.
You're misunderstanding the level at which html5lib operates: it merely parses to a tree (using a "tree builder" to provide a common API to the parser to build the tree, which can be a DOM tree, an ElementTree, an lxml tree, whatever) and provides a generic "tree walker" API that walks over one of those tree formats and provides a common stream of events (start tag, end tag, text, comment, etc.) which can then be used, e.g., in the serialiser.
This can therefore be used with Gumbo by passing the lxml tree builder into its html5lib.parse like method.
PyPy JIT and html5lib is about 8x faster as it is cpython.