Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Python bindings to the Servo HTML5 parser, html5ever (github.com/tbodt)
146 points by tbodt on June 19, 2017 | hide | past | favorite | 14 comments



The html5ever parser source [1] is remarkably easy to read, since it uses the Rust macro system to represent the state transitions declaratively. It also uses pattern matching to nice effect.

[1]: https://github.com/servo/html5ever/blob/master/html5ever/src...


Perhaps a better comparisson is this: https://github.com/servo/html5ever/blob/master/html5ever/src...

https://html.spec.whatwg.org/multipage/parsing.html#data-sta...

Where spec:

      U+0000 NULL ->
      This is an unexpected-null-character parse error. Emit the current input character as a character token.
translates into:

      FromSet('\0') => go!(self: error; emit '\0'),
If you ignore the FromSet which is used for small set of characters, and `go!` which is a macro, you get something akin to:

     '\0' => emit_error; emit '\0'


I would hope it would be nice to read since one could argue Rust was designed for the purpose of building Servo. So, if you can't implement Servo nicely in Rust, it'd be a pretty bad design.


I think you've mistaken what pcwalton was trying to say. This library is interesting because (via macros) it creates a DSL that attempts to emulate how the HTML 5 spec is written, in order to more easily verify the correctness of the implementation. Note that, by dint of much of HTML being an accident of history, the HTML 5 spec is somewhat imposing; it's not going to be a cakewalk in any language, and at the same time it's bespoke enough that it would be silly to design your language to cater to the needs of the HTML 5 spec in particular. The fact that Rust can create DSLs via macros does help here, though I wouldn't recommend this approach for anything other than a similarly extreme case. In fact I'd say this library has the most extensive macro use of any production Rust code I've ever seen, it's quite atypical as far as Rust code goes.


If Rust were solely designed for the purpose of building Servo, we'd have OO - Servo quite badly wants it for the DOM - and a whole bunch of other one-off features. In practice, Rust is a language built for systems development in general with Servo having been an early testbed to make sure it's going in the right direction.


It might be worth linking to the talk that discusses html5ever's use of macros? I'm having trouble finding it.




This is exciting. It's using Cython [1]

To the author:

What do you feel about binding python to rust? Did you use any tutorials?

[1] https://github.com/tbodt/htmlpyever/blob/880da57/setup.py#L5


Seems to be using lxml's C API for treebuilding; I wonder how that compares (perf wise, primarily) to using libxml2 directly and then calling adoptExternalDocument?


I did not know adoptExternalDocument existed...


Heh, okay, so it wasn't a deliberate decision!


This sounds useful. It should parse the same way Firefox does.


As a contributor to html5ever, it isn't made to parse the same way Firefox does, just so that it parses HTML5 correctly. Hopefully, the two are the same, but in practice some Firefox parser errors/behaviors won't be reproduced.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: