Writing a recursive ascent parser by hand

dwheeler · on April 8, 2018

As an exercise in showing how LR works, this was an interesting article. However, I'm unconvinced that this would be a good approach to implement in a real system when the grammar might change in the future (and they pretty much always do). When things change, I think it'd be difficult to keep it correct, and there are lots of tools to help (such as bison and ANTLR). I've implemented several recursive-descent parsers; their simplicity makes it practical to do directly. But recursive-descent, if you're going to use it, seems to cry out for tooling. Does anyone have a different experience?

lihaoyi · on April 8, 2018

If you do not mind a small performance hit then a PEG parser combinator library like FastParse (http://www.lihaoyi.com/post/EasyParsingwithParserCombinators...) is basically tooling to help you write recursive descent parsers. Way easier to set up and learn the bison/antlr if you don't need best-of-the-best performance: just import the library, start sketching out how your parser should behave and you're off to the races: parsing, debugging, error-reporting, etc. are all made really easy.

At least in the Scala community, FastParse is used quite a lot. Works great for parsing source code, obscure data-formats (textual or binary), user query-languages, etc.. As long as you're not parsing gigabytes of big data, you probably won't even notice the 5-10x performance hit.

necubi · on April 8, 2018

For Rust there's the nom [0] parser combinator library. Very pleasant to use, good support for expressing parse errors, with speed quite close to hand-written parsers.

[0] https://github.com/Geal/nom

userbinator · on April 8, 2018

I'm not aware of any practical uses of recursive ascent either; given that the major production-quality compilers (GCC, Clang, icc, MSVC) all use handwritten recursive descent and can handle grammars as complex as C++, it doesn't seem the claimed extra power of an LR parser is really needed at all in practice.

watwut · on April 9, 2018

The biggest issue is not so much parsing the grammar itself, that part is relatively easy, but reasonable error handing and recovery from errors. I think that is why they use something that is hand-written.

dwheeler · on April 9, 2018

Whups... I meant to say that I can see hand-implementing recursive-descent, but recursive-ASCENT seems to cry out for tooling. Especially since there are so many good tools out there.

Rusky · on April 8, 2018

Yeah, I agree. When writing this I spent a lot of time keeping track of LR states to decide how to split things up into functions, but in a recursive descent parser that's already decided by the grammar.

I can see myself using it for expression parsing, though, since that's almost inherently left-recursive. I usually use top-down operator precedence (AKA pratt parsing) for that, but recursive ascent might simplify it in some ways.

abecedarius · on April 9, 2018

Expression parsing without left recursion can be straightforward: https://github.com/darius/parson/blob/master/eg_calc.py

A left-associative operator means a Kleene star in the grammar, while a right-associative operator means right-recursion. I think the common view that it's left-recursive is just a weakness of the most common scheme for semantic actions. (I mean, yes, if you need to go through a parse tree first then the tree needs to be left-recursive -- but it seems simpler not to do that.)

Rusky · on April 9, 2018

The idea that it's left-recursive is not changed by the existence of top-down operator precedence, or kleene stars, or right recursion.

And it's not a weakness, either- just one way to think about the problem which happens to map fairly naturally to how people think about expressions.

tyingq · on April 8, 2018

Haven't used it myself, but have heard many rave about Perl's Parse::RecDescent

FractalLP · on April 9, 2018

I think Damian Conway (Perl Author, Keynote speaker, compsci professor, seemingly super nice guy too) wrote the recursive descent parser module and its successor module, although someone else maintains it now on CPAN.

agumonkey · on April 9, 2018

if his code is as interesting as his talks then I want to write parsers in perl

FractalLP · on April 9, 2018

I barely even write perl, but still read part of his OO book and learned a lot

tyingq · on April 10, 2018

Perl OO is underrated. The bless() function and Perl's packages are straightforward and make OO seem less magic.

You even get to pick the encapsulation. Most people go with hashtable based objects, but you can use lists or even strings instead. It's a really good language to see how OO works under the covers.

https://perldoc.perl.org/perlootut.html

https://perldoc.perl.org/perlobj.html

FractalLP · on April 13, 2018

Agreed. I found it a bit of a pain though to write using it as most languages have so much syntactic sugar for OO.

swiley · on April 8, 2018

Recursive descent is very different (if I understand correctly.) Some languages that can be parsed with a recursive ascent parser cannot be parsed by a recursive descent parser.

barrkel · on April 8, 2018

Yes; left recursion. However in practice left recursion can usually be eliminated by adjusting the grammar. Other cases can be handled using various degrees of grammar action cleverness. And finally, recursive descent lends itself naturally to parsing a subset of the language using a different scheme; top down operator precedence for expressions to avoid lots of redundant recursion is a fine example.

nnq · on April 9, 2018

> can usually be eliminated by adjusting the grammar

Never understood this backwards way of thinking. You first write/immagine the grammar that you want to write, whatever is most intuitive for you.

Then you figure out what kind of parser your grammar needs.

Why would one contort a grammar? I mean, even "contorting it to not be context-dependent" is already a limitation enough...

neel_k · on April 9, 2018

Making parsing easily incrementalizable is a valuable property, because gramars which incrementalize easily also localize parse errors well, and having both of these properties hold lets you offer convenient APIs for IDEs and text editors much more easily.

For this, you want (as much as possible) for local edits to result in local modifications to the syntax tree, and ensuring that can require thinking carefully both about lexical syntax and abstract syntax. At the lexical level, if the literal syntax forbids newlines in string literals, then the changes to the token stream on a single-character edit are limited to changes to the tokens on the same line. Likewise, in the abstract syntax, if top-level declarations have a keyword, then the effect of a single-token change (eg, adding a mismatched paren) are limited to the declaration it occurs in.

There are a lot of tricks like this (eg, setting things up so you can find and parse the type declarations without having to parse a whole expression), but it's reasonable not to want to learn them. In that case, a really good heuristic is to make your language LL(1) parseable. It's not perfect, but it will rule out the vast majority of things that make incrementalization hard.

barrkel · on April 9, 2018

Why would you let the grammar dictate the architecture of your compiler?

Left recursion in things like expressions is completely trivial to eliminate. If you want to e.g. embed a DSL in a language, something that's really easy to do with recursive descent, it's worth spending 30 seconds or so to remove left recursion from your expression rules. Etc.

> You first write/immagine the grammar that you want to write, whatever is most intuitive for you.

IMO this is the road to hard to parse languages; if you genuinely advocate this path, you'll end up needing a GLR parser before you know it. I think it's much better to ensure your grammar is parseable with LL(1); not only is the tooling simpler, but the language will be less ambiguous for human users too. Follow the example of Pascal, not C++. This is what I'd advocate even for people doing hand-written recursive descent on a new language: validate your grammar with an LL(k) tool even if you don't use the generated parser.

(You've ended up sounding a bit like Ira Baxter, who's often seen around the net selling his DMS toolkit. He has a strong self-interest in people writing complicated grammars, so that he can sell them his services...)

nnq · on April 11, 2018

> but the language will be less ambiguous for human users too

I'd bet against this. The human brain works very differently, we always keeps something like "trees of contexts" in our head. And why would you thing left recursions is less intuitive to a human user than other kinds of recursion.

> ended up sounding a bit like Ira Baxter

Now I gotta take time and google who this guy is :P

tom_mellior · on April 9, 2018

> You've ended up sounding a bit like Ira Baxter

Haha, I was thinking about him too. He does have a point, though, that on 2018 machines we might be able to afford parsing technology a bit more expensive than the 1970s approaches commonly used today.

daemin · on April 9, 2018

For the most part you can apply some simple rules to transform the BNF notation from being left recursive to being right recursive. It's most useful when you have a right recursive language but are using yacc or bison for the parser.

tyingq · on April 8, 2018

Right. Was responding to "But recursive-descent, if you're going to use it, seems to cry out for tooling"

barrkel · on April 9, 2018

As I mentioned in an uncle comment, recursive descent permits localised hackery to switch parsing scheme or dynamically fix tricky context-specific grammar constructs, which argues against tooling.

Tooling is still very useful for grammar validation though, and it's the way to go for ad hoc parsing of a simple or new language.

rejschaap · on April 9, 2018

That was a typo, he meant to say recursive-ascent needs tooling. Recursive-descent parsers can be hand-written quite easily.

Cieplak · on April 8, 2018

This is a pretty cool JSON parser in C:

https://github.com/vincenthz/libjson/blob/master/json.c

chubot · on April 8, 2018

I'm pretty sure that Brendan Eich's Narcissus (JavaScript in JavaScript) uses a "recursive ascent" parser, or what I would call a hand-written bottom-up parser:

https://github.com/mozilla/narcissus/blob/master/lib/parser....

I played with this code like 8 years ago and remember being pretty confused by this style of parser. I was expecting recursive descent but it was something else.

Rusky · on April 8, 2018

That definitely looks like normal recursive descent- it has functions `Script`, `Statements`, `Statement`, `FunctionDefinition`, `Expression`, etc.

chubot · on April 9, 2018

Recursive ascent / bottom-up also has functions that follow the grammar, so that doesn't mean anything.

It's a matter of whether the children are resolved before the parents, or the parents are resolved before the children.

The pushTarget() calls all over makes it look like bottom-up parsing. That doesn't appear in recursive descent (top down) parsers.

Rusky · on April 9, 2018

> It's a matter of whether the children are resolved before the parents, or the parents are resolved before the children.

Yep, and Narcissus resolves the parents before the children by virtue of `Script` calling `Statements`, `Statements` calling `Statement`, etc. before ever actually seeing the bodies of their corresponding production rules.

A bottom-up parser doesn't work that way at all. Its states aren't determined by what they're about to parse, but by what has already been parsed. The `pushTarget` function in Narcissus is completely unrelated to bottom-up parsing.

chubot · on April 9, 2018

Yeah I see what you mean. There is some stuff that doesn't look like a recursive descent parser to me, but that's probably because there are some JavaScript idioms and that's not my main language.

BrendanEich · on April 9, 2018

The code you mentioned, pushTarget in particular, is for semantic checking, nothing to do with purely syntactic analysis AKA parsing. In particular, pushTarget is for break to label and continue to label (the "Target" is a labeled statement's exit or continue point).

chubot · on April 10, 2018

OK interesting, thanks. Yes that definitely could be part of my confusion -- that the code is doing more than parsing; it's also doing some semantic analysis in the same pass.

titzer · on April 9, 2018

Early on in the WebAssembly project, when it was still a pre-order AST encoding, I wrote an LR-style shift-reduce parser for the bytecode. It was fun exercise, until it came to debugging. It's fast but relatively hard to maintain, even for a constrained, designed bytecode like WebAssembly. WASM is now a stack machine, so parsing maintains an abstract stack. It's remarkably close to the same algorithm, but a lot easier to think about because of WASM's (now) explicit stack.

fjfaase · on April 9, 2018

Many parsing algorithms were designed with the idea that the input cannot stored in memory as a whole and that you need the parse the file from start to end. However, nowadays, memory is usually not a problem, especially not for parsing files that are editted by humans. Who is going to use a gigebyte sized source file? When you store the source in memory this allows for parsing algorithms that can go backwards. I implemented a recursive decent algorithm with back-tracking which stores all parsed non-terminals at the location where they were first parsed. This results in a simple and relatively fast parser. I used it to implement an interpretting parser, which takes a grammar and input file and parses the input file according to the grammar, returning an abstract parse tree. Actually the parse takes an abstract parse tree of the grammar to parse the input. The grammar is parsed by the interpretting parser itself using a hard coded abstract parse tree of the grammar of the grammar. See https://github.com/FransFaase/IParse

tom_mellior · on April 9, 2018

> Who is going to use a gigebyte sized source file?

I used to mess around with the ROSE C++ source-to-source transformation framework (http://rosecompiler.org/). One of the files in there (defining the AST structure for representing C++ programs) was about 100,000 lines of autogenerated code. It wasn't heavily templated, I think, but it did pull in a bunch of standard C++ headers. I could imagine the whole preprocessed thing approaching a gigabyte.

fjfaase · on April 9, 2018

Yes, parsing generated source code might be a problem, and yes, C++ preprocessing adds a lot as well. But even 100,000 lines of 80 characters whould still be (a little less than) 8 Mbyte. I did not design my interpretting parser with maximum performance as a goal, but just to have a parser that is easy to use and allows for much flexibility with respect to the grammar specification. I feel that it is a good parser for domain specific languages.

tom_mellior · on April 9, 2018

> even 100,000 lines of 80 characters whould still be (a little less than) 8 Mbyte

Um, you're right of course. I did do the same math in my head but somehow got confused by two orders of magnitude.

pubby · on April 9, 2018

Top-down (descent) parsing gets a nice speed boost when you realize creating an AST isn't always necessary. Top-down parsing happens in the same traversal order as AST iteration, and this makes creating an AST not required.

Bottoms-up parsing on the other hand typically requires creating the AST first, only to traverse it top-down in a manner resembling recursive descent. The speed advantages of bottoms-up are often lost in the inefficiencies of the AST.

Of course this doesn't always apply. It depends on the language.

master_yoda_1 · on April 9, 2018

At first I thought the parser is written by hand literally :)

sarah2020 · on April 8, 2018

[flagged]

tejasmanohar · on April 8, 2018

Flag? What's this link without context?