I enjoy messing around parsing things etc. Although I started with handmade unschooled attempts many decades ago, I later went through the classic yak/bison phase etc before firmly getting back into the hand-rolling custom side of things where I'm much happier.
My main motivation is speed, e.g. I have enjoyed handcrafting wickedly fast custom JSON and SQL parsers and bits of code that sit on top of them.
My general approach now is to use a tokenizer that generates an int that represents each token, where the bits in the int tell me the location of the token in the source buffer and its type etc.
In languages with a lot of per-object overhead like Java this is literally a long; but in the C/C++/rust camp it can look and feel like an object or struct or whatever because, underneath, it ends up still being an int that gets passed in registers and on the stack etc.
Sometimes the parsing is one-pass and the tokens don't need to be stored or anything; its usually the memory allocations that kill parsing performance, and a once-through json decoder can completely eliminate bottlenecks on hot paths in data processing etc.
Other times I run through once and store these tokens in an array, particularly if I'm going to be going back over them etc. Its actually easy to make a function that, given an 'int' token, finds the next one, so if you are going through the data several times you don't need any allocations. But other times you want to go backwards or you are really going to be going through the data a lot so it makes sense to store the offsets of everything.
Sometimes future steps will be reordering things and transforming the AST etc; in those cases, I generally have a writeable arena where I append the text of new tokens, and a bit in the token ints discriminate between the read-only source buffer and this transformed buffer. This is all particularly cool when it comes to generating sensible error messages with context, which is a finesse most handmade parser makers rue later that they had overlooked :)
I would be interested to know just how unmainstream this kind of approach is? Please weigh in, would love to learn new tricks :)
I am very interested in parsing and optimizing them. Currently I have an html parser written in go. I have used Go's html parser but it is waaay slower than mine. Here what I do:
- stream the content (either from a file or from network) and send chunks to the tokenizer
- tokenizer is dead simple with a state machine
- when an html tag is found, emit the token with positions (problem is someone needs to hold the whole incoming buffer. currently its copying the incoming bytes but I plan to hold the bytes until a token is found. after emitting the token with the buffer (copying) the inner buffer will be freed)
- parser does its thing and creates a DOM
as you can see I am not well versed in optimizing to the last bit but I am very interested in diving deep into the optimizations.
Currently I am looking at zero-copy networking. there is supposedly a library for go (fasthttp) that would not copy bytes from networking interface but I havent tried it yet.
What kind of algorithms would you recommend for tokenizing & parsing html / xml very efficiently ?
There are so many 'it depends' here, based on your use-cases and perhaps details specific to go optimisation that I am unaware of.
The first step, tokenization, might well be very go-specific around zero-copy networking and things, sorry. The general idea, though, would be to use a nice little finite-state-machine to eat characters and recognise when the tags open and close, the name of the tag, the attributes and values, the comments and the body etc. And if you can be avoiding allocating any kind of object on a heap at this stage, it'll almost always be a big win.
But you need to then create a DOM, and you're going to need memory allocation for that.
But the arena approach in the article is good for this; use a big byte array as an arena, and be writing a cleaned-up normalized copy of the parsed html into it, with length-prefixes and distance-to-next-tag etc baked in. Typically a DOM has nodes with parent, previous sibling and next sibling pointers. You'll probably still want these, but these can be offsets written into the byte buffer in between fragments of the parsed html, rather than maintained as a separate structure.
Then, if the user of your DOM can modify it, you can have a node navigation API that seamlessly stores rewritten DOM parts in a new byte arena or appended on the end of the original arena.
It may seem unexpected given all the hype around Go, but it is a surprisingly poor choice for this. If you want a more convenient language than C++ or Rust but retain the ability to reach optimal performance, C# could serve you much better.
A quick google finds https://github.com/ffenix113/fastxml which seems to be doing the 'tips and tricks' of arena parsing and things. Any idea how fast it compares when you get away from memory allocations and things and end up just seeing how the compiler does basic byte manipulation?
The description indicates it is not production ready, and is archived at the same time.
If you pull all stops in each respective language, C# will always end up winning at parsing text as it offers C structs, pointers, zero-cost interop, Rust-style struct generics, cross-platform SIMD API and simply has better compiler. You can win back some performance in Go by writing hot parts in Go's ASM dialect at much greater effort for a specific platform.
For example, Go has to resort to this https://github.com/golang/go/blob/4ed358b57efdad9ed710be7f4f... in order to efficiently scan memory, while in C# you write the following once and it compiles to all supported ISAs with their respective SIMD instructions for a given vector width: https://github.com/dotnet/runtime/blob/56e67a7aacb8a644cc6b8... (there is a lot of code because C# covers much wider range of scenarios and does not accept sacrificing performance in odd lengths and edge cases, which Go does).
There is a lot more of this. Performance and low-level primitives to achieve it have been an area of focus of .NET for a long time, so it is disheartening to see one tenth of effort in Go to receive so much more spotlight.
probably worth noting that the default peg backend for the c combinator parser 'hammer' uses an arena allocator for speed and because it's a good match for packrat memoization. there are hammer bindings for several languages including python and java, not sure if there's a rust one yet. hammer also has some other backends including lalr, and in some cases you can write a grammar such that you can parse it with either packrat or lalr
hammer's arena allocator is not very fast, though. it's faster than any malloc i've seen, but a good arena allocator is faster than calling an empty function, so you have to inline the fast path. the open-source prototype arena allocator i wrote in http://canonical.org/~kragen/sw/dev3/kmregion.h (and .c) doesn't quite reach that level, but it takes about 1.9 nanoseconds per allocation (520 million allocations per second on one core), while glibc's malloc takes about 19 nanoseconds per allocation. more recent versions of glibc (on a faster cpu) have reduced that to 14 nanoseconds in that particular microbenchmark. details are at the link. your mileage may vary
for those who aren't aware, garbage collectors for languages like js, c#, ocaml, or java are usually generational copying collectors for efficiency, which means they use what is effectively an arena allocator for most new allocations. often they reserve a cpu register or two for this, which my c implementation above can't
for c, the apache portable runtime 'apr' has a generic arena allocator in it called pools, which also supports destructors (the drop trait, in rust) and nested pools
the arena allocator in gcc is called 'obstacks' and got moved into libiberty so you can use it in other c programs. or hypothetically in rust
Hammer's arena allocator is, in fact, hot garbage. The entire reason it's there in the first place is that I didn't want to get into the weeds integrating a GC when building a proof of concept that PEGs could work at runtime in C with replaceable backends, so I didn't bother tuning it much beyond "4k sounds like a good number".
To me, the valuable part of hammer was the idea that you could use a friendly PEG-like API and end up with an LALR or GLR parser, and that, I believe, was a resounding sucess.
On the other hand, the generated AST turned out to be even less pleasant to work with than I imagined, and the only part of the implementation that's worth salvaging is the bit reader and writer routines.
haha! well, you're being harsher on your own work than i think it deserves
hammer works pretty well, i think, even the ast generation part. judicious application of semantic actions can give you whatever ast you want, if you're willing to stick to the peg backend anyway
i'm curious what you think about the prototype arena allocator i linked above; i wrote it originally as a candidate replacement for hammer's
> To release memory we simply release the blob itself, so it's like a batched deallocation, sort of. The consequence here is that all of our types must be:
> 1. trivially copyable (: Copy if we speak in Rust)
> 2. trivially destructible (i.e. they can't have impl Drop)
2 (Drop) seems obvious enough, but why 1 (Copy)? There’s a bit of Rust knowledge implicit there that I’m unaware of, and I might not be alone in that. Could someone explain?
I think it's a mistake. Allocator Implementation[1] shows values being moved into the arena, which can be done without Copy or even Clone. Maybe the author got confused, since moving an object in Rust just copies the bytes and invalidates the moved from binding.
I'm pretty sure being trivially destructible is sufficient for this use case. The only reason you might want a Copy bound when you're not actually copying values is if you're trying to disallow interior mutability, but that doesn't seem to be the case here. Perhaps the author was misinformed.
Or maybe they're just using Copy as a type-level bound that guarantees a trivial destructor, even if it's more restrictive than strictly necessary.
You can't have a negative trait bound to say "I take anything that _doesn't_ implement Drop". It's not a stabilized feature in Rust because it would allow adding a trait to a struct to become a breaking API change where it currently is not. Copy guarantees !Drop so it's currently the only way to do it without opting into nightly features.
You can remove the requirement for trivially destructible types by maintaining a singly linked list of (ptr to alloc'd object, destructorFn) pairs with the links allocated from the arena itself. The arena can just walk the list and call all the drop functions when it gets reset. You can even specialize for types with trivial destructors so they don't append to the list and so you only pay the extra cost if you allocate Drop types (using std::mem::needs_drop).
> You can't have a negative trait bound to say "I take anything that _doesn't_ implement Drop". It's not a stabilized feature in Rust because it would allow adding a trait to a struct to become a breaking API change where it currently is not. Copy guarantees !Drop so it's currently the only way to do it without opting into nightly features.
Alternatively, you could just do an std::mem::needs_drop() check when inserting an element, and fail if it indicates that it would need to be dropped. (Yeah, its documentation protests that it can have false positives, but IIRC you have to be doing super wacky type-level stuff to trigger that for real.)
That's why I said that Copy is sufficient (but restrictive) as a type-level bound, since needs_drop() can't be done on that level. At least, not without the trick of throwing a post-mono error with an associated const.
Trivial destructibility alone is generally sufficient for that in Rust. Recall that all values (outside a Pin) are trivially moveable, so pretty much the only reason you'd want trivial copying is if you actually need more than one valid copy at once. (One exception is using Copy to express a poor man's Freeze bound, which is quite limited, and dubious in any case.)
A type that is not Copy may have associated memory allocations, however. You cannot simply forget a range of !Copy structs without the risk of memory leaks.
If those !Copy structs deallocate their memory when dropped properly, then they aren't trivially destructible. If they were trivially destructible, then they would leak their memory regardless of whether you dropped or forgot them. That's what it means for a type to be trivially destructible: dropping it is a no-op.
I think we're talking about the same thing -- the original post in the thread used "no impl Drop" as "trivially destructable", not "no functional difference between drop and forgetting".
Well, I interpret "trivially destructible (i.e. they can't have impl Drop)" as meaning "the type itself can't impl Drop, nor can any of its fields or subfields", which is what it means for a type to be "trivially destructible" in Rust.
But what do you mean by "trivially destructible" being different from "no functional difference between drop and forgetting"? Unless I'm mistaken, they are the same thing: if none of a type's subfields impl Drop, then dropping it does absolutely nothing; whereas if any subfield does impl Drop, then it can't be a no-op, since the compiler has to form a &mut reference to pass it into Drop::drop() (a distinction very relevant for unsafe code).
Could you give me an example of a !Copy type that is "trivially destructible", but can't be forgotten without leaking memory?
This. Writing a parser in Zig is so simple. Just allocate once, and then start writing your parser.
One allocator for parser, one for scanner. One for type allocation. Keep them all for semantic analysis. Write the output renderer (binary/language). Deallocate.
In this whole process, it makes it so easy to not think about memory anymore. Just enjoy writing your program.
The scanner can probably be zero allocation though - zig should be able to express an iterator over the input byte array that returns a token on dereference without needing more state than the position in the array
Parsing is fun, but assuming a good baseline allocator arena allocation doesn’t get you a whole lot these days (maybe it’s better under a GC environment where tear down does not require a tree traversal?), especially if you’re able to lazily build the AST (this is what I did in JSC years ago).
I still feel that there’s a lot of practical parsing stuff that isn’t considered in academia which leads to a near constant preference for LALR and LR parser generators despite them being honestly worse than recursive descent in almost every important metric if you want to do anything at all other than producing a completely generic AST for known good data.
This is a good idea. The tree constructed by a parser is relatively likely to want similar lifetimes for most of the nodes.
If the nodes are mutable you can link them together as you go, meaning some amount of overhead in the structure for links that aren't in use, but no garbage left behind.
Storing pointers as distance from the start of a contiguous arena means a structure you can mmap later without relocating.
Oh yeah, arena allocation is great. I've used it a few times and it's always a huge performance improvement. The requirements are that (i) the objects you're allocating all have the same type, and (ii) they all have the same lifetime (i.e. it's okay not to free any of them while others still live).
It wouldn't. For some reason in Rust people tend to apply the term "arena" to "vector + indices" as well as the proper definition, which does have that restriction.
your (i) is incorrect (arena allocators can allocate objects of different types, and indeed one of their strong points is that they don't suffer from fragmentation) and your (ii) doesn't make sense because not freeing an object is always okay, according to conventional programming language semantics anyway
this particular arena allocator does allocate objects of different types, almost arbitrary types in fact:
the safety concern is that nothing outside the arena survive the arena if it contains a pointer into the arena. as the linked micro-book puts it, 'it shouldn't be possible for any arena-allocated [variable] to outlive the memory that it[s data]'s located on.'
> not freeing an object is always okay, according to conventional programming language semantics anyway
Conventional semantics have to hit reality eventually. Long-running programs can't just alloc forever without free, hence an arena not being the only allocator in most long-running programs.
yes, this is certainly a weak point in conventional semantics, but not relevant to this issue, which is about ensuring no dangling pointers, not about memory leaks
if you're running low on memory, switching to an arena allocator is probably a bad idea, because it will usually increase your memory requirements (by delaying deallocation longer), not decrease them
My main motivation is speed, e.g. I have enjoyed handcrafting wickedly fast custom JSON and SQL parsers and bits of code that sit on top of them.
My general approach now is to use a tokenizer that generates an int that represents each token, where the bits in the int tell me the location of the token in the source buffer and its type etc.
In languages with a lot of per-object overhead like Java this is literally a long; but in the C/C++/rust camp it can look and feel like an object or struct or whatever because, underneath, it ends up still being an int that gets passed in registers and on the stack etc.
Sometimes the parsing is one-pass and the tokens don't need to be stored or anything; its usually the memory allocations that kill parsing performance, and a once-through json decoder can completely eliminate bottlenecks on hot paths in data processing etc.
Other times I run through once and store these tokens in an array, particularly if I'm going to be going back over them etc. Its actually easy to make a function that, given an 'int' token, finds the next one, so if you are going through the data several times you don't need any allocations. But other times you want to go backwards or you are really going to be going through the data a lot so it makes sense to store the offsets of everything.
Sometimes future steps will be reordering things and transforming the AST etc; in those cases, I generally have a writeable arena where I append the text of new tokens, and a bit in the token ints discriminate between the read-only source buffer and this transformed buffer. This is all particularly cool when it comes to generating sensible error messages with context, which is a finesse most handmade parser makers rue later that they had overlooked :)
I would be interested to know just how unmainstream this kind of approach is? Please weigh in, would love to learn new tricks :)