Hacker News new | past | comments | ask | show | jobs | submit login
Parsing Text with Nom (adamchalmers.com)
52 points by adamch on Jan 12, 2022 | hide | past | favorite | 19 comments



I wanted to parse log files with Rust, and found Nom, however before that I found someone elses attempt at it:

https://www.cloudcity.io/blog/2018/11/08/parsing-logs-230x-f...

In the end I didn't use Nom for the time being, as the author said:

> While wondering if there was a way to make it faster, I started re-reading the nom docs carefully, and that’s when I noticed that “sometimes, nom can be almost as fast as regex”. Feeling pretty silly, I went and rewrote my rust program to use the regex crate, and sure enough it got 3x faster.

Nom parser library is slower than regex crate? In simple cases it might not be worth it unfortunately.


Interesting! The quote "sometimes, nom can be almost as fast as regex" has been removed from the Nom docs, so it's probably become faster since 2018 when that Cloudcity article was written. In fact, Nom now claims to "outperform many parser combinators library like Parsec and attoparsec, some regular expression engines and even handwritten C parsers" [1].

Someone used Nom for Advent of Code last year and found "The regex approach benchmarked at about 1ms while the parser approach benchmarked at 145 nanoseconds." [2] Maybe I'll try benchmarking Nom vs. regex for a follow-up post.

I find regexes are easier than parser combinators for simple tasks. If the problem is small, parser combinators are overkill. But if the problem gets complicated, I think parser combinators are more readable. You can break a parser combinator into small, well-documented, well-tested parts more easily than a big regex. But I still use regex more in my work, because most complex parsing I just throw into serde.

[1] https://github.com/Geal/nom

[2] https://www.christopherbiscardi.com/advent-of-code-2020-in-r...


They must have combinator for regex so you can escape to it if needed.


You can use lazy_static to have global Regex and call that from a Nom function. There is no special support from Nom required.

Alternatively one can build a parser combinator library that compiles to a large Regex. That Regex would return a big Matcher, so a convenience wrapper for that would need to be made usable by a wrapper created from the same parser combinator.


> Alternatively one can build a parser combinator library that compiles to a large Regex.

Maybe I'm misunderstanding what you meant, but I don't think you can do this in the general case (since parser combinators can describe languages that are more complicated than/not describable within regular languages).


I wrote a crate to facilitate this:*

https://github.com/dfhoughton/pidgin

As it says there, you can only build non-recursive grammars this way.

And the reason I wrote that crate:

https://github.com/dfhoughton/two-timer

And the reason I wrote that crate:

https://github.com/dfhoughton/jobrog

And having written these crates, I went back to writing Ruby for my day job. I am not a very experienced rustacean, and what skill I developed writing these things has faded, but I use the last one daily, so the regex-based parser is still working pretty well.

* It's a "parser combinator library" inasmuch as it allows you to write reusable parsing rules that can be components of other rules.


I think you shouldn't need to escape to regex for performance. I don't understand why would regex crate be so much faster in Rust? Some crazy optimizations the nom doesn't make?

Not an parser guru, but when I used megaparsec in Haskell course, I never thought I should switch to regex for speed.


Yes, there are crazy optimizations in the regex crate. :-)

But it almost certainly depends on what you're doing. If you're using Nom, you're probably performing a parsing task. A regex might be faster there, but maybe not by too much, depending. If you're doing a searching task though, perhaps where there are few matches relative to the size of the haystack, then it's quite plausible that the regex will go a lot more than 3x as fast as Nom.

In any case, I'm not sure if Haskell is comparable. I'm not sure that any Haskell native regex engine is really known for its speed, although I haven't done any sort of comprehensive benchmarking.

There are other considerations. A parser written with Nom might be easier to read and/or manipulate than a parser written with regex. But even there, it depends.


Combinator alternatives, especially nested ones, will "naively" iteratate over cases until match is found.

In regexp those (nested) alternatives may be expressed much more efficiently as state machine or whatever.


> "Parsing" is turning a stream of raw text or binary into some structured data types, i.e. a Rust type that your code can understand and use.

This is a similar take to https://lexi-lambda.github.io/blog/2019/11/05/parse-don-t-va..., which goes into more detail on why "structured" is a requirement.

(Hi Adam!)


In js/ts-land going from unknown type to known one normally means:

1. JSON.parse(...) string to json value

2. runtime assert it ie. with [0] which is a form of parsing but on json input as opposed to string parsing

In other words, when dealing with jsons, it's better to work on combinators over json values than combinators over string – the code is much more terse, natural and performant.

[0] https://github.com/appliedblockchain/assert-combinators


This can be seen as a form of tokenization, where the JSON objects are trees of tokens.

But that said, a more apt comparison is going from text to JSON in the first place (which is not trivial to do fast, and has issues when it comes to stream parsing).

There are enormous advantages to using JSON for storing structured application data though, mostly so you don't have to worry about writing a parser for it.


Hi Josh! "Parse, don't validate" is one of my favourite programming blog posts of all time. Between Nom and Serde, I find it very easy to use that approach in Rust.


As a beginner to both, when do you find yourself reaching for one instead of the other? Or do you use nom when deserializing with serde?


Serde is good for self-describing formats like JSON, YAMK, TOML etc. At least that's what I use it for. Nom isn't necessary for them!

Nom is good for protocols where you need to know the schema in advance. E.g. a binary key-value store, where the binary in the key and value has to be deserialised into some special rules.

Nom can also handle bit-level parsers very easily. It's easier to use the parsers in nom::bits than remembering all the bit-shifting tricks I haven't used in five years.

But I'm not an expert on Serde, I suspect it can do bit-level too. I've never written my own Serde library, I just use crates like serde-json.


I use Serde when it's a common format that already has implementations for Serialize/Deserialize, and nom/peg when it's a custom format.


You mean "Parse, don't validate" right?


Oops, yes


Parser combinators are great, we're using parser combinators in production, they are great ie. if somebody wants to play from typescript [0] – they create very readable code that is easy to compose/extend.

[0] https://github.com/preludejs/parser




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: