There are so many 'it depends' here, based on your use-cases and perhaps details...

There are so many 'it depends' here, based on your use-cases and perhaps details specific to go optimisation that I am unaware of.

The first step, tokenization, might well be very go-specific around zero-copy networking and things, sorry. The general idea, though, would be to use a nice little finite-state-machine to eat characters and recognise when the tags open and close, the name of the tag, the attributes and values, the comments and the body etc. And if you can be avoiding allocating any kind of object on a heap at this stage, it'll almost always be a big win.

But you need to then create a DOM, and you're going to need memory allocation for that.

But the arena approach in the article is good for this; use a big byte array as an arena, and be writing a cleaned-up normalized copy of the parsed html into it, with length-prefixes and distance-to-next-tag etc baked in. Typically a DOM has nodes with parent, previous sibling and next sibling pointers. You'll probably still want these, but these can be offsets written into the byte buffer in between fragments of the parsed html, rather than maintained as a separate structure.

Then, if the user of your DOM can modify it, you can have a node navigation API that seamlessly stores rewritten DOM parts in a new byte arena or appended on the end of the original arena.