As an exercise in showing how LR works, this was an interesting article. However, I'm unconvinced that this would be a good approach to implement in a real system when the grammar might change in the future (and they pretty much always do). When things change, I think it'd be difficult to keep it correct, and there are lots of tools to help (such as bison and ANTLR). I've implemented several recursive-descent parsers; their simplicity makes it practical to do directly. But recursive-descent, if you're going to use it, seems to cry out for tooling. Does anyone have a different experience?
If you do not mind a small performance hit then a PEG parser combinator library like FastParse (http://www.lihaoyi.com/post/EasyParsingwithParserCombinators...) is basically tooling to help you write recursive descent parsers. Way easier to set up and learn the bison/antlr if you don't need best-of-the-best performance: just import the library, start sketching out how your parser should behave and you're off to the races: parsing, debugging, error-reporting, etc. are all made really easy.
At least in the Scala community, FastParse is used quite a lot. Works great for parsing source code, obscure data-formats (textual or binary), user query-languages, etc.. As long as you're not parsing gigabytes of big data, you probably won't even notice the 5-10x performance hit.
For Rust there's the nom [0] parser combinator library. Very pleasant to use, good support for expressing parse errors, with speed quite close to hand-written parsers.
I'm not aware of any practical uses of recursive ascent either; given that the major production-quality compilers (GCC, Clang, icc, MSVC) all use handwritten recursive descent and can handle grammars as complex as C++, it doesn't seem the claimed extra power of an LR parser is really needed at all in practice.
The biggest issue is not so much parsing the grammar itself, that part is relatively easy, but reasonable error handing and recovery from errors. I think that is why they use something that is hand-written.
Whups... I meant to say that I can see hand-implementing recursive-descent, but recursive-ASCENT seems to cry out for tooling. Especially since there are so many good tools out there.
Yeah, I agree. When writing this I spent a lot of time keeping track of LR states to decide how to split things up into functions, but in a recursive descent parser that's already decided by the grammar.
I can see myself using it for expression parsing, though, since that's almost inherently left-recursive. I usually use top-down operator precedence (AKA pratt parsing) for that, but recursive ascent might simplify it in some ways.
A left-associative operator means a Kleene star in the grammar, while a right-associative operator means right-recursion. I think the common view that it's left-recursive is just a weakness of the most common scheme for semantic actions. (I mean, yes, if you need to go through a parse tree first then the tree needs to be left-recursive -- but it seems simpler not to do that.)
I think Damian Conway (Perl Author, Keynote speaker, compsci professor, seemingly super nice guy too) wrote the recursive descent parser module and its successor module, although someone else maintains it now on CPAN.
Perl OO is underrated. The bless() function and Perl's packages are straightforward and make OO seem less magic.
You even get to pick the encapsulation. Most people go with hashtable based objects, but you can use lists or even strings instead. It's a really good language to see how OO works under the covers.
Recursive descent is very different (if I understand correctly.) Some languages that can be parsed with a recursive ascent parser cannot be parsed by a recursive descent parser.
Yes; left recursion. However in practice left recursion can usually be eliminated by adjusting the grammar. Other cases can be handled using various degrees of grammar action cleverness. And finally, recursive descent lends itself naturally to parsing a subset of the language using a different scheme; top down operator precedence for expressions to avoid lots of redundant recursion is a fine example.
Making parsing easily incrementalizable is a valuable property, because gramars which incrementalize easily also localize parse errors well, and having both of these properties hold lets you offer convenient APIs for IDEs and text editors much more easily.
For this, you want (as much as possible) for local edits to result in local modifications to the syntax tree, and ensuring that can require thinking carefully both about lexical syntax and abstract syntax. At the lexical level, if the literal syntax forbids newlines in string literals, then the changes to the token stream on a single-character edit are limited to changes to the tokens on the same line. Likewise, in the abstract syntax, if top-level declarations have a keyword, then the effect of a single-token change (eg, adding a mismatched paren) are limited to the declaration it occurs in.
There are a lot of tricks like this (eg, setting things up so you can find and parse the type declarations without having to parse a whole expression), but it's reasonable not to want to learn them. In that case, a really good heuristic is to make your language LL(1) parseable. It's not perfect, but it will rule out the vast majority of things that make incrementalization hard.
Why would you let the grammar dictate the architecture of your compiler?
Left recursion in things like expressions is completely trivial to eliminate. If you want to e.g. embed a DSL in a language, something that's really easy to do with recursive descent, it's worth spending 30 seconds or so to remove left recursion from your expression rules. Etc.
> You first write/immagine the grammar that you want to write, whatever is most intuitive for you.
IMO this is the road to hard to parse languages; if you genuinely advocate this path, you'll end up needing a GLR parser before you know it. I think it's much better to ensure your grammar is parseable with LL(1); not only is the tooling simpler, but the language will be less ambiguous for human users too. Follow the example of Pascal, not C++. This is what I'd advocate even for people doing hand-written recursive descent on a new language: validate your grammar with an LL(k) tool even if you don't use the generated parser.
(You've ended up sounding a bit like Ira Baxter, who's often seen around the net selling his DMS toolkit. He has a strong self-interest in people writing complicated grammars, so that he can sell them his services...)
> but the language will be less ambiguous for human users too
I'd bet against this. The human brain works very differently, we always keeps something like "trees of contexts" in our head. And why would you thing left recursions is less intuitive to a human user than other kinds of recursion.
> ended up sounding a bit like Ira Baxter
Now I gotta take time and google who this guy is :P
Haha, I was thinking about him too. He does have a point, though, that on 2018 machines we might be able to afford parsing technology a bit more expensive than the 1970s approaches commonly used today.
For the most part you can apply some simple rules to transform the BNF notation from being left recursive to being right recursive. It's most useful when you have a right recursive language but are using yacc or bison for the parser.
As I mentioned in an uncle comment, recursive descent permits localised hackery to switch parsing scheme or dynamically fix tricky context-specific grammar constructs, which argues against tooling.
Tooling is still very useful for grammar validation though, and it's the way to go for ad hoc parsing of a simple or new language.
I'm pretty sure that Brendan Eich's Narcissus (JavaScript in JavaScript) uses a "recursive ascent" parser, or what I would call a hand-written bottom-up parser:
I played with this code like 8 years ago and remember being pretty confused by this style of parser. I was expecting recursive descent but it was something else.
> It's a matter of whether the children are resolved before the parents, or the parents are resolved before the children.
Yep, and Narcissus resolves the parents before the children by virtue of `Script` calling `Statements`, `Statements` calling `Statement`, etc. before ever actually seeing the bodies of their corresponding production rules.
A bottom-up parser doesn't work that way at all. Its states aren't determined by what they're about to parse, but by what has already been parsed. The `pushTarget` function in Narcissus is completely unrelated to bottom-up parsing.
Yeah I see what you mean. There is some stuff that doesn't look like a recursive descent parser to me, but that's probably because there are some JavaScript idioms and that's not my main language.
The code you mentioned, pushTarget in particular, is for semantic checking, nothing to do with purely syntactic analysis AKA parsing. In particular, pushTarget is for break to label and continue to label (the "Target" is a labeled statement's exit or continue point).
OK interesting, thanks. Yes that definitely could be part of my confusion -- that the code is doing more than parsing; it's also doing some semantic analysis in the same pass.
Early on in the WebAssembly project, when it was still a pre-order AST encoding, I wrote an LR-style shift-reduce parser for the bytecode. It was fun exercise, until it came to debugging. It's fast but relatively hard to maintain, even for a constrained, designed bytecode like WebAssembly. WASM is now a stack machine, so parsing maintains an abstract stack. It's remarkably close to the same algorithm, but a lot easier to think about because of WASM's (now) explicit stack.
Many parsing algorithms were designed with the idea that the input cannot stored in memory as a whole and that you need the parse the file from start to end. However, nowadays, memory is usually not a problem, especially not for parsing files that are editted by humans. Who is going to use a gigebyte sized source file? When you store the source in memory this allows for parsing algorithms that can go backwards. I implemented a recursive decent algorithm with back-tracking which stores all parsed non-terminals at the location where they were first parsed. This results in a simple and relatively fast parser. I used it to implement an interpretting parser, which takes a grammar and input file and parses the input file according to the grammar, returning an abstract parse tree. Actually the parse takes an abstract parse tree of the grammar to parse the input. The grammar is parsed by the interpretting parser itself using a hard coded abstract parse tree of the grammar of the grammar. See https://github.com/FransFaase/IParse
> Who is going to use a gigebyte sized source file?
I used to mess around with the ROSE C++ source-to-source transformation framework (http://rosecompiler.org/). One of the files in there (defining the AST structure for representing C++ programs) was about 100,000 lines of autogenerated code. It wasn't heavily templated, I think, but it did pull in a bunch of standard C++ headers. I could imagine the whole preprocessed thing approaching a gigabyte.
Yes, parsing generated source code might be a problem, and yes, C++ preprocessing adds a lot as well. But even 100,000 lines of 80 characters whould still be (a little less than) 8 Mbyte. I did not design my interpretting parser with maximum performance as a goal, but just to have a parser that is easy to use and allows for much flexibility with respect to the grammar specification. I feel that it is a good parser for domain specific languages.
Top-down (descent) parsing gets a nice speed boost when you realize creating an AST isn't always necessary. Top-down parsing happens in the same traversal order as AST iteration, and this makes creating an AST not required.
Bottoms-up parsing on the other hand typically requires creating the AST first, only to traverse it top-down in a manner resembling recursive descent. The speed advantages of bottoms-up are often lost in the inefficiencies of the AST.
Of course this doesn't always apply. It depends on the language.