Hacker News new | past | comments | ask | show | jobs | submit login
Parsing bitstreams with Nom (adamchalmers.com)
105 points by adamch on March 3, 2022 | hide | past | favorite | 34 comments



This is very clever, and a great demonstration of how flexible and generic Nom is!

It's not quite as inspired, but I wrote a minimal bitstream cursor/parser from scratch in Rust a few months ago[1]. It's developed with LLVM bitstreams in mind, but the API itself is suitable for just about any bitstream.

[1]: https://docs.rs/llvm-bitcursor/latest/llvm_bitcursor/index.h...


I'm collecting a bit of a benchmark for rust bitstream libraries inside my own library[1], and I wasn't aware of yours, so I'll add it to the list this afternoon!

[1]: https://github.com/nickbabcock/bitter


Awesome! I'll be interested to see how my (extremely naive) implementation stacks up.


Updated the benchmark with your implementation! You can see the results here: https://github.com/nickbabcock/bitter#comparison-to-other-li...


I have started fiddling with rust, and one thing that I miss from the Elixir world is binary pattern parsing/matching. The elixir model is so much nicer, I've used byteorder to convert a datetime to a byte array and then the reverse and it is not the most elegant code, nom looks much nicer. I wish rust had Elixir/Erlang like binary pattern matching.


Agreed this would be a nice addition.

https://docs.rs/bitmatch/latest/bitmatch/ works nicely but only for a single integer.

https://internals.rust-lang.org/t/pre-rfc-binary-patterns/89... didn't get much discussion but seems the closest to Erlang.


Nom is a fantastic library. I've never implemented a parser before and I found it relatively easy to pick up. Some of that is due to parser combinators in general, no doubt, but it was very satisfying to be able to compose a parser with nom.


I completely agree. I wrote a parser for my 6502 assembler and nom together with nom-locate allowed me to write a parser that could recover from errors, parse trivia, return detailed error messages with locations and make everything easily unit testable. The main drawback was that because it's a strongly typed parser combinator you can end up with some pretty inscrutable errors when you combine the parsers incorrectly.


I agree, it's surprisingly usable given how complicated parsing can be! I think the key to its ease of use, at least for me, is that you can unit test the individual parsers. You can program very iteratively, building it piece by piece and testing as you go.

This is much easier than trying to design the entire parser on a whiteboard/paper and implement it in one big go, I think.


I can only second that, and it's only getting better over time. It used to be quite fiddly if you wanted to work on the String rather than the byte level, but that has improved a lot.

I've written 4(?) different parsers with different major versions of nom, and each has been easier than the previous one.

The main thing I'd like to see improved related to it would be `cookie-factory`, its basically symetric serialization library.


If you want some practice, you could try solving Advent of Code 2021 day 16: https://adventofcode.com/2021/day/16


Was reading this with an odd sense of deja vu.

Then it hit me: If we ignore the odd choice of 3 & 5 bit fields, this is very similar to ASN.1


That is exactly why I ended up learning bitwise Nom, and the inspiration for this blog post :)


Nom is a fantastic library. I have built a SIP library [1] on top of Nom, no way I would have built that without Nom's help, and even if I did, it would be a heck of a mess and under-optimized code.

[1]: https://github.com/vasilakisfil/rsip


I have to note ocaml-bitmatch/bitstring. Example code: https://bitstring.software/documentation/#matching-bitstring...

This was extremely convenient for me in the past.


In the past (also Advent of Code 2021), I found the bitvec crate’s BitSlice pretty convenient for parsing bitstreams.


Just to echo this, I’m currently using a crate called deku (for a toy project to learn rust). Deku is a wrapper on top of bitvec which lets you annotate structs and enums for decoding and encoding. The speed I was able to implement a binary TCPStream encoder/decoder was great. Both have been rock solid for me.

Highly recommend both libraries!


Oh, interesting. I've used Bitvec for serializing binary data, and I can see how you can use it for parsing single bits or u8s... do you have an example of how to parse, say, a 3-bit binary number with Bitvec?


Something like this:

  use bitvec::prelude::*;  // BitView trait, Msb0, etc

  let my_input = vec![0b11100000u8];
  let bitslice = my_input.view_bits::<Msb0>();
  
  let my_num = bitslice[..3].load_be::<u8>();

  // continue parsing after my_num
  let nextslice = &bitslice[3..];
You would use Lsb0 instead of Msb0 if you wanted to parse bytes starting at the low bit.

It would be more convenient with a peek/take style wrapper over the slice, but I don't think one is included in the crate.


That's really neat, thanks for showing me. This looks very straightforward for parsing a straightforward format.

I think Nom might be more useful as the format gets more complicated... the combinators can really save you quite a bit of time. Having things like many0, or delimited_pair, or length_count already tested and documented saves time compared to implementing + testing it yourself. It's cool seeing how many different approaches you can use to solve the same problem :)


No problem!

> I think Nom might be more useful as the format gets more complicated... the combinators can really save you quite a bit of time. Having things like many0, or delimited_pair, or length_count already tested and documented saves time compared to implementing + testing it yourself.

Totally agree -- bitvec works for things that are easy to hand write a parser for. Nom is a lot more powerful.


If you like bitvec, check out [deku](https://github.com/sharksforarms/deku)


Thanks! Does deku work with records that are not byte-aligned? I checked out the examples and docs and all looked byte-aligned.


When I see rust code I have the feeling that as a developer you are in a constant battle with the compiler to fulfill its type system.

I wonder if some unit test / fuzz tests in a less strict language could lead to the same amount of code quality yet at a higher level of work satisfaction while writing the code.


In my experience, types which prove certain properties are usually a lot simpler than the tests verifying the same (at least tests which actually have a good chance of catching transgressions, instead of just checking a single hard-coded input). So more expressive type systems bring me a higher level of satisfaction, because they lead to simpler code than the code I'd need to write if I needed to test the same things. They also provide valuable documentation. On the other end of the spectrum, when I look at untyped Python code, I feel completely lost and find myself scouring the documentation for information that should be obvious from function definitions, but is instead specified in prose 3 links away from documentation of the function I was interested in.


Types guarantee properties over the full set of possible inputs before you run any code. If you can enforce a property with a type, it is IME far less work than trying to enforce that property with a test.

"Less strict" inevitably means not caring about a bunch of edge cases. It's a bit like "overflow is undefined behavior in C"; nobody wants to care about how "i++" is a statement that's (a) absolutely everywhere and (b) potentially invokes undefined behavior at runtime, so the only way to guarantee that out involves a lot more proof work somewhere.


> If you can enforce a property with a type, it is IME far less work than trying to enforce that property with a test.

That's an unnecessarily broad generalization to a concrete statement, and if it were true more people would write their programs in dependently typed programming languages or whatever.

> nobody wants to care about how "i++" is a statement that's (a) absolutely everywhere and (b) potentially invokes undefined behavior at runtime, so the only way to guarantee that out involves a lot more proof work somewhere.

In many domains, the high standards you're displaying here will almost immediately throw you out of business. In particular "i++" for most applications is just fine to write without any second thoughts. It's not even a good idea to write a test. Just run the program and fix it in the rare case something it doesn't work as expected (it's almost a safe bet the issue won't UB).

You're not taking a tank to go to the supermarket either, are you?


> more people would write their programs in dependently typed programming languages or whatever.

Programming language choice is greatly influenced by ecosystem factors. If you wanted to write a program for the browser, you had to write it in Javascript - so people went to considerable lengths to fit typed front ends onto Javascript.

> Just run the program and fix it in the rare case something it doesn't work as expected (it's almost a safe bet the issue won't UB).

> You're not taking a tank to go to the supermarket either, are you?

No, but if you're going into a live fire hostile environment you might want a tank, and the internet is such an environment. If there is a bug in your parser, and the parser is exposed to untrusted data, then that's a potential avenue for exploit.

Less of an issue in managed-runtime languages, but it's still possible to have semantic level exploits. I remember there was one for Apple which exploited the fact that there were different proplist parsers. https://daringfireball.net/linked/2020/05/02/psychic-paper


Preamble : not saying you're wrong.

The idea of fighting ("[being] in a constant battle with") the compiler seems weird to me, the compiler is my automated helper, the stricter it is, the better I work and I thank it for correcting me.


ERROR: There is a problem with your comment. Operator`,`: no overload matching `(english::main-clause, english::main-clause)` was found.

No need to thank me.


Not an ERROR, while not main clauses, they are clauses not essential to the meaning of the sentence. See #4 here:

https://www.iue.edu/student-success/coursework/commas.html


Does anyone have a good "how to parser combinator" introduction? I've been looking at C# to dismantle various binary formats and this may be a good approach.


My previous blog post about Nom tries to do that: https://blog.adamchalmers.com/nom-chars/





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: