Chumsky, a Rust parser-combinator library with error recovery

jitl · on July 9, 2022

I haven't written a parser with Chumsky, but I've played with a little one a bit if you wanna see an example syntax. The error reporting for this project is implemented with `ariadne` which is also really slick.

Parser: https://github.com/ekzhang/percival/blob/main/crates/perciva...

Error reporting: https://github.com/ekzhang/percival/blob/main/crates/perciva...

Datalog playground: https://percival.ink/

To see an error report, delete some punctuation from one of the Datalog code blocks then press shift-return.

brundolf · on July 9, 2022

I'm probably going to convert my compiler project to Rust just so I can use Chumsky and Ariadne. The error feedback and recovery is so much better than anything I could have written by hand

cercatrova · on July 9, 2022

The creator of Chumsky is also quite a big proponent of getting generic associated types stabilized in Rust [0], interestingly enough. They have several comments talking about how GATs were very helpful for Chumsky to model their parsing and combinators.

[0] https://github.com/rust-lang/rust/pull/96709

dataangel · on July 9, 2022

It's weird to me how much effort is being put in on that thread to find examples of crates needing it when people wanted templated typedefs that they could put inside classes in C++ for YEARS before C++11. The use cases for this stuff were around 15+ years ago!

fanf2 · on July 9, 2022

My main question is how this compares to nom which has long been a solid choice for parser combinators in Rust. But no mention in the readme?

mullr · on July 9, 2022

Error recovery in nom is left as a very obtuse exercise to the reader. Custom error reporting is difficult at best. That stuff is supposed to be better in chumsky; I don’t know if it actually is.

However, for my own parser which is currently written in nom, my current plan is to port it over to tree-sitter. Its error recovery is completely automatic, and a fair sight better than anything I have time to do by hand.

atoav · on July 9, 2022

nom chumsky?

kjeetgill · on July 9, 2022

Thank you for this revelation. I'd always imagined nom being about "eating tokens" but this makes so much sense for a parser.

brundolf · on July 9, 2022

How do tree-sitter's ergonomics compare to these other two?

mullr · on July 9, 2022

Caveats: I've used nom in anger, chumsky hardly at all, and tree-sitter only for prototyping. I'm using it for parsing a DSL, essentially a small programming language.

The essential difference between nom/chomsky and tree-sitter is that the former are libraries for constructing parsers out of smaller parsers, whereas tree-sitter takes a grammar specification and produces a parser. This may seem small at first, but is a massive difference in practice.

As far as ergonomics go, that's a rather subjective question. On the surface, the parser combinator libraries seem easier to use. They integrate well with the the host language, so you can stay in the same environment. But this comes with a caveat: parser combinators are a functional programming pattern, and Rust is only kind of a functional language, if you treat it juuuuust right. This will make itself known when your program isn't quite right; I've seen type errors that take up an entire terminal window or more. It's also very difficult to decompose a parser into functions. In the best case, you need to write your functions to be generic over type constraints that are subtle and hard to write. (again, if you get this wrong, the errors are overwhelming) I often give up and just copy the code. I have at times believed that some of these types are impossible to write down in a program (and can only exist in the type inferencer), but I don't know if that's actually true.

deep breath

Tree-sitter's user interface is rather different. You write your grammar in a javascript internal dsl, which gets run and produces a json file, and then a code generator reads that and produces C source code (I think the codegen is now written in rust). This is a much more roundabout way of getting to a parser, but it's worth it because: (1) tree-sitter was designed for parsing programming languages while nom very clearly was not, and (2) the parsers it generates are REALLY GOOD. Tree-sitter knows operator precedence, where nom cannot do this natively (there's a PR open for the next version: https://github.com/Geal/nom/pull/1362) Tree-sitter's parsing algorithm (GLR) is tolerant to recursion patterns that will send a parser combinator library off into the weeds, unless it uses special transformations to accommodate them.

It might sound like I'm shitting on nom here, but that's not the goal. It's a fantastic piece of work, and I've gotten a lot of value from it. But it's not for parsing programming languages. Reach for nom when you want to parse a binary file or protocol.

As for chumsky: the fact that it's a parser combinator library in Rust means that it's going to be subject to a lot of the same issues as nom, fundamentally. That's why I'm targeting tree-sitter next.

There's no reason tree-sitter grammars couldn't be written in an internal DSL, perhaps in parser-combinator style (https://github.com/engelberg/instaparse does this). That could smooth over a lot of the rough edges.

strogonoff · on July 11, 2022

Tree-sitter appears to be ultra-focused on producing valid syntax trees really fast. This is great for e.g. syntax highlighting, but suboptimal in cases where you are writing a reference parser for your custom language and want to provide very useful error descriptions. Chumsky seems to be more suited for the latter (and also has a part of tutorial about precedence[0], so it seems to deal at least with that case).

This overview of parser tradeoffs may be helpful: https://blog.jez.io/tree-sitter-limitations/.

[0] https://github.com/zesterer/chumsky/blob/master/tutorial.md#...

IshKebab · on July 10, 2022

I've used Nom. It isn't really that suited to parsing languages with things like precedence. It also doesn't have any error recovery, and error messages are very basic. It's ok if you are designing your own language because you can design the language to be easy to parse with Nom but I'm not sure I'd recommend it for parsing an existing language.

I've also used Tree Sitter. It has error recovery and a powerful grammar system but the downsides are that the grammar system is quite confusing compared to parser combinators, it's written in C which makes cross compilation a pain, and it doesn't actually do the whole job. You get a stringly typed tree of nodes that you have to do a second parse over. Quite tedious. Acceptable if you don't need a full AST though, e.g. you're just searching for specific nodes.

I haven't tried Chumsky yet but I definitely will. Looks very promising.

charleskinbote · on July 9, 2022

I was going to ask the same thing. I've used nom for a library of mine but wasn't totally satisfied with it, so I think I'll give this a try.

mpalmer · on July 9, 2022

It's mentioned in the performance section, "another crate with similar design".

https://github.com/zesterer/chumsky#performance

monocasa · on July 9, 2022

That compares to pom, not the more generally used nom that the parent is asking about (and I'm curious about as well).

mpalmer · on July 9, 2022

So it does. I leave my comment as a shame-faced warning to people who read too quickly before commenting.

voxl · on July 9, 2022

Actually trying to write a parser with this was something else for me, the kinds of types i was looking at seemed impenetrable. It looks very nice, but usability, at least for me, made me wash my hands and just roll my own parser.

I've used nom successfully in the past, even when it was macro-hell. Part of that might have been the greater amount of available combinators though, making getting really into the weeds less likely.

brundolf · on July 9, 2022

Yeah the types themselves are fairly impenetrable, but if you follow the tutorial it's not too hard to learn to actually use. I just did the tutorial and it made a lot of sense

In fairness, the same can be said of Rust's iterator types; they drive autocomplete and they surface errors when you do something wrong, but they're not really directly readable. This sort of thing is the reason `impl` types exist

aljazmerzen · on July 9, 2022

I can vouch for adriane, the error display library that is the sister project of Chomsky.

We integrated it into PRQL compiler and the errors are beautiful!

https://github.com/prql/prql/pull/275

de_keyboard · on July 9, 2022

I'm interested in how this library handles recursion, e.g.

   Expr = '(' Expr ')'
        | Expr '+' Expr

It's very easy to get stuck into infinite loops when handling recursion in parser-combinator libraries.

Does this library improve on that?

zesterer · on July 14, 2022

Chumsky still has trouble with left recursion, like many PEG parsers, but it's fairly easy to rewrite such grammars without left recursion as demonstrated in the tutorial: https://github.com/zesterer/chumsky/blob/master/tutorial.md

ufo · on July 9, 2022

I wonder what are the error recovery strategies that it implements. The README doesn't go into details.

avgcorrection · on July 9, 2022

There’s also Pomsky which is a language alternative to regex.