I'm curious as to what was the motivation behind this project at Google. It seems to me that the only benefit of writing something like this in pure C would be performance gains over existing parsers, but it specifically says in the README that parsing performance was not one of the goals.
It actually arose out of a templating language project within Google, which was written in C++. We evaluated the existing C/C++ parsers (which at the time were Webkit, the auto-generated port of validator.nu, and another Google-internal parser - we didn't learn about Hubbub until later), and found that the effort needed to integrate with our project, and the number of dependencies they would bring into the serving system, precluded us from using them easily. Hixie suggested "Just write your own! It shouldn't be too hard, the algorithm is all specified in the HTML spec" (har, har, famous last words), and Gumbo was born out of naivete and youthful optimism. :-)
There were a bunch of reasons for the choice of C over C++:
1. At the time, we were doing a bunch of stuff with LLVM in the templating language. I'd previously been responsible for trying to integrate LLVM with C++ generated code, and it is painful, mostly because of name mangling and vtable dispatch. Providing a C API sidesteps this entirely, as LLVM can call into C code and use C structs no problem, and once the API is in C there's little reason to make the internals be in C++.
2. We wanted to provide tooling for this templating language, and the easiest way to write tooling is in Python or some other scripting language. It's easier to provide Python etc. bindings with a C API than a C++ API.
3. We'd intended from the start to open-source this. One of the team members had significant open-source experience, and he pointed out that within the open-source community, there are a number of people who basically refuse to use C++ and will instantly disqualify a C++ library. So regardless of whether these people are right, to reach the maximum number of people and prospective projects it should be in C.
Most languages can be extended by C code without too much fuss. If the C is written in a platform-independent manner (which should be possible for this library), it's pretty much write-once, run-anywhere. Even if optimal performance wasn't a goal, it's still nice to have a single portable, performant html5 parser that's been tested against billions of pages.