I respect this article a lot (and I appreciated the Rust compiler generator pointer in it!); my own
personal take is one of partial agreement, but I would like to raise two points here:
1. Part of compiler construction courses is to instill in the students the notion
of metea-linguistic abstraction (SICP), namely to solve problems by creating
and implementing domain-specific languages (DSLs) that permit declarative statements
about the domain in a natural and therefore maintainable notation, which is separate
from an engine that can interpret or compile it.
Compiler classes are ideal to teach about DSLs because the topic is parsing and
compiling/iterpreting anyhow, but not only that, they can see that compiler people
do as they say: they use the techniques for compiler construction themselves, which
is often but not always done using even more than one DSLs (one for lexers, one for grammars,
sometimes even DSLs). Parser generators are also used outside of programming languages,
for example in computational linguistics formal grammars are often used to model
human language morphology, phonology or syntax (with ambiguous grammars). Tomita's GLR parsing
- mentioned elsewhere in this thread - comes from that application area.
2. Another part of compiler construction tertiary education is to get across the
utility of theory for practical applications and the closeness of theory and practice
in computer science: Formal language theory is beautiful and well-developed, and to
show to students that it is also useful because it gives runtime and space complexity
guarantees and systematic approaches for reducing faulty implementations is a wonderful
thing.
One could also interpret the "failure" of compiler writers to do this as evidence that the "DSL paradigm" for programming itself is, at best, extremely limited.
One obvious problem is that the semantics of a DSL arent extendable in that DSL, making it highly brittle. So, eg., with parsing, as soon as you need bespoke syntax-error handling, etc. etc. you're leaving the world of any DSL you can create. Since "bespoke" here will, ultimately mean, a semantic specialisation that no generic DSL will support.
At the point where your DSL enables these abitary specialisations, it is just the host language.
In Clojure, the adage says "When building DSL, favor data over functions over macros". When I write a DSL I start by writing a sample of it as pure data (think of Clojure as a data processing oriented language with embedded JSON (to be exact, the format is called EDN)). Then I write the parsing function. What I do though is to make sure the parser is extensible. To do so I cast any data element that this function parses into a corresponding function, except if that element is already a function. This way I can directly write custom functions within the DSL. This is very handy when handling conditions and their composition through function combinators. I can just reuse the language boolean logic (if, if-not, switch-cases etc) and don't have to reimplement it withing the DSL. Once the DSL data has been parsed into functions, they are then composed through classical function composition available in the language. Sometimes I also wish I had access to the DSL through macros, mostly to optmize those DSL. The situation I have found myself to need this required me to impact the compiler's state in some way, so I didn't carried my investigations forward regarding this subject, and I'm very interested in the development of Lux/luxlang since it comes with macros that are passed the compiler's state monadically.
The DSL paradigm is extremely limited, because there are almost no people who are good at writing programming languages that work for developers instead of the other way around.
XML didn’t fail because of some technical limitation. It failed because it tried to turn us all into language designers. At worst JSON asks you for a schema, and that was an afterthought in both cases.
The problem with XML was slow parsing and the horrible syntax of XSD schemas. Turning attributes into elements is also a pain in the ass. And for a lot of applications turning whitespace into elements is unnecessary.
Those are confounding factors. If you made any other product with the same goals, but faster parsing and better schema syntax, probably by integrating the two, then you'd still end up with the same mess because defining a language is worrying about a million corner cases and not being able to fix more than half of them once they ship.
An xml/json schema is for validating a serialization format; a DSL-in-ruby is a programming language (to be fair, it's likely a mess, but I've definitely seen it be less mess).
When you look at business rules / requirements, it's common for 99% of them to be trivial. Of the remainder, 99% are only moderately complex.
If you have, say, fewer than 10k distinct rules, you probably don't have more than one that's really wicked.
However, if you have 100k rules, it's more likely than not that you have quite a few that are going to completely wreck your xml/json model.
> unfortunately many big brain school teach only parser generator tool. here grug usual love of tool is not: parser generator tool generate code of awful snakes nest: impossible understand, bottom up, what? hide recursive nature of grammar from grug and debug impossible, very bad according grug!
In terms of education I disagree about the importance of "LR" (really meaning bottom-up, table driven, automatically generated) parsers vs recursive descent (top-down, typically hand written), for a couple of reasons:
1) Real world compilers are almost always hand written recursive descent. They just turn out to be more flexible, easier to maintain, generate better error messages, etc.
2) Teaching compiler writing as part of a computer science degree is somewhat of a tradition, but the number of students who will ever write a compiler (even for a small DSL, let alone a real language is minimal). I think it's a useful exercise to show how these things are structured and teach general "divide-and conquor" decomposition of complex programs, but that's about it. Given the number of moving parts in a compiler, the emphasis should be on an overview of all the parts and how they fit together, not going overboard on parsing and teaching two different parsing techniques. To the extent that any Comp-Sci student will ever end up needing to write a parser at some point in their career (say for a DSL, or to parse some data format), then recursive descent is the more useful one to know.
Parser generators seem largely an historical approach from a time (1970's) when computers were so slow that the speed of a table driven parser was a major advantage, and the idea is firmly implanted in the heads of those of us that grew up reading Aho & Ullman's compiler-writing "Dragon Book". I think they still serve a purpose, but perhaps for more fringe use cases rather than writing actual compilers for full-blown languages!
Having said all that, there's a lot of fun in writing a parser generator, regardless of how useful the end result may be! I wrote a full LR(1) - not YACC-like LALR(1) - parser generator in Modula-2 back in the early 80's, at a time when the text books told you this was impractical/impossible due to the size of a canonical Knuth LR(1) parser.
I like the idea of specifying a parser through a formal grammar but I think that current technology has a few pretty annoying-severe flaws.
* Parser generator grammars are usually quite far from 'paper' versions of a formal grammar. The moment you have to deal with associativity/precedence or left-recursive rules the beautiful grammar tends to fall apart.
In theory most generators have mechanisms to help make this ergonomic but irl most grammars I've seen are totally illegible.
* Error messages suck. I think this is actually a bigger issue, but parser generators have pretty poor error messages compared to even naive recursive descent parsers written with a library like megaparsec. There has been research on generating better error messages in generated parsers like menhir [1], but I haven't seen anyone apply this successfully and it requires you to manually annotate each error state with a message, and a real parser can easily have thousands, and they can change each time you update your grammar.
* Parser files are usually fairly illegible because they mix host language code with parser syntax and meta-variables. Some more modern libraries like Pest separate the two which I think is a net improvement.
Nevertheless, I hope that someone figures out a way to square the circle and frees us from handwriting parsers in all but the most exceptional cases.
I wrote a parser generator quite a long time ago that I think improves the syntax quite a lot, and which has an interesting approach to generalisation: you can write conditions on the lookahead (which are just grammars that need to be matched in order to pick a given rule when a conflict needs to be resolved). This construct makes it much easier to write a grammar that matches how a language is designed.
Here's an ANSI-C parser, for example: https://github.com/Logicalshift/TameParse/blob/master/Exampl... - this is an interesting example because `(foo)(bar)` is fully ambiguous in ANSI C: it can be a cast or a function call depending on if `foo` is a type or a variable.
An advantage over GLR or backtracking approaches is that this still detects ambiguities in the language so it's much easier to write a grammar that doesn't end up running in exponential time or space, plus when an ambiguity is resolved by the generalisation, which version is specified by the grammar and is not arbitrary (backtracking) or left until later (GLR).
I was working on improving error handling when I stopped work on this, but my approach here was not working out.
(This is a long-abandoned project of mine but the approach to ambiguities and the syntax seem like they're novel to me and were definitely an improvement over anything else I found at the time. The lexer language has a few neat features in it too)
> Parser generator grammars are usually quite far from 'paper' versions
The more useful of parser input formats allow precedence and associativitiy annotations as well as optionals, quantifiers and quantifiers with separators.
This really only works great together with your third point,
> Parser files are usually fairly illegible because they mix host language code with parser syntax
You don't need host code if the parser just outputs an AST. Neither quantifiers with or without separators nor optionals make the parser spec more complex if no host code is needed -- the complexity is in the parser generator and AST generator.
> Error messages suck
I found error messages okay-ish when you view them in an IDE where it is more important where an error occurred than what exactly is wrong (only applies to syntax errors). I also found it helpful to show the allowed input at that place that would NOT have resulted in a syntax error -- and the parser generator can emit a list of both terminals and nonterminals automatically, because this information is used by it to detect a syntax error in the first place.
I also tripped over the often-said problem that LALR parsers produce less useful error messages than LR(1) parsers. In today's time, just using LR(1) is underrated IMHO. In my experiments, the table didn't get too large (400kB parse table for a simple hardware-description language) and I would have tried to apply compression algorithms on the table instead of resorting to LALR, but that size was small enough to be acceptable to me. Only the run time of the table generation process was a bit annoying (10-20 seconds).
Either use classic LR(1), or bison supports IELR(1), which supports all LR(1) grammars (i.e. will produce identical parses to classic LR(1)), but generates parse tables that nearly the same size as LALR(1).
I'm not sure about quality of IELR's error messages, but at least you don't get the new LALR introduced conflicts that are hard to wrap your head around.
Your experience with the usefulness of error messages from top-down (recursive descent) vs bottom-up (table-driven) parsers is opposite to common experience. The trouble with bottom-up parsers is that they inherently (by way of being bottom-up) have no top-down context, so all they can by default generate is a bottom-up "symbol XXX expected" type error. A top down parser knows what high level construct it's trying to parse, so is in a position to generate a more meaningful error.
Of course in either case the grammar could be augmented with thoughtful error constructs designed to generate meaningful errors when parsed, but that doesn't seem to be common practice given how awful compiler error messages tend to be. Try writing "std::cout" instead of "std::endl" and see what gibberish error message g++ generates. clang is generally better than g++ in this regard.
I can't speak about common experience, only my own, but that tells me that an LR(1) can tell which high-level construct it is trying to parse -- this is encoded in the states deeper down in the parse stack, and therefore depends on information which the parser/generator already has.
While to some extent, this may be ambiguous because it depends on tokens further to the right, this is inherent in the "L" part of LR and applies to LL as well; only raising the lookahead length can fix that.
Now I did cheat a bit in what I said above: Parsing further terminal symbols "removes" the high-level information from the state, and then it cannot be used for error messages anymore. This has to notable aspects: First, one can simply keep such information as metadata, either encoded in the state (but this blows up table size because states cannot be merged as easily anymore) or as additional information collected while parsing (probably better -- good idea, I'll have to look at this). ()
Second, there is the question whether this information is really useful in producing better error messages. An input fragment like "(5 + )" is an incomplete expression, and without further metadata you know that (1) another sub-expression has to appear before the closing parenthesis, (2) you are parsing an expression, (3) you are parsing a high-level construct that expects an expression (this knowledge is inherent in LR(1)), but you don't* know anymore what high-level construct you are parsing anymore -- but for the error message it is probably irrelevant.
Obviously, that last part depends, case by case.
(*) re-reading that, I'll have to admit that this incorporates top-down aspects into the parser :)
If you're using "LR" to mean bottom-up generated parsers (as OP does, rather than meaning LR(1) grammars, parsed by whatever means), then there is no higher level construct info on the stack - these operate in strictly bottom-up fashion.
e.g. first an identifier might be matched (which could occur in dozens of different high level contexts - but the parser has no idea at this point), then depending on what comes next it might turn out to be part of an expression (vs the RHS of an assignment, or parameter, or type definition, etc, etc), so after a few "reduce" actions the parser will have realized it's an expression, which again could occur in many contexts (LHS of an assignment, subject of an if-statement, for-statement, etc, etc), and so it goes ...
At any point in time a bottom-up parser knows nothing about the higher level context that it hasn't yet got to - only the tiny program fragment is is progressively recognizing in bottom-up fashion. This is one of the fundamental differences between a bottom-up and top-down (recursive descent) parser.
I was referring specifically to an LR(1) bottom-up parser, but I was confused how the stack looks like (all this was some time ago).
The parser might indeed encounter an identifier without knowing whether it is part of an expression. However, it does know whether an expression is valid at that point. This isn't encoded buried in the stack but it is encoded in the top-of-stack state (this is the part that I was confused about).
For example, suppose that the input is now at a point where a statement starts in a Java-like language. An identifier may be part of many different kinds of statements, and the parser will not know which of them an identifier starts, nor whether that identifier is part of an expression. It does, however, know that a statement is valid at this point, because this information can be found in the transition table used after reduction. It will also know which high-level constructs are valid from that table, including statement and expression, and from the token transition table it will know which tokens are valid.
What you might be referring to is that while the transition table will show that an expression is allowed, it does not easily show whether such an expression must be part of a statement or can indicate another high-level construct. My experience with using such a parser in the context of an IDE is that it is okay to show them both, because most syntax errors made by a user are far more trivial (such as the wrong number of closing parentheses).
Note that an LL(1) top-down (recursive descent) parser would not know either if the first identifier is part of an expression because that information cannot be derived from the left-side context. Only a larger lookahead will fix that. Or backtracking, which is kind of similar to a larger lookahead.
> The parser might indeed encounter an identifier without knowing whether it is part of an expression. However, it does know whether an expression is valid at that point. This isn't encoded buried in the stack but it is encoded in the top-of-stack state (this is the part that I was confused about).
No, it really isn't!
Table driven parsers basically have two actions encoded in their tables - for any (state, current input symbol) pair (i.e. table indices) they can either SHIFT or REDUCE. A SHIFT action means to push the symbol onto the stack since it doesn't yet know what it's part of. A REDUCE action means to pull the top N symbols (terminals and non-terminals) off the stack and replace them with the RHS of the grammar rule they represent that it has now recognized.
e.g.
If there was only one grammar rule:
A -> x y z
Then when the parser saw "x" it would SHIFT, then "y" => SHIFT, then "z" => SHIFT, then (with next symbol end-of-file) REDUCE - popping (x, y, x) off the stack and replacing them with A. This REDUCE action is the first time it knew that those x, y, z were part of A. In a real grammar with more than one rule there may be multiple REDUCE actions (with different RHS) depending on what followed x [y [z]].
In a top-down (recursive descent) parser you know what you're parsing at any point, and then go looking for what you hope to be there. e.g. if you see "if" then you will call ParseIfStatement which will call ParseExpression etc.
In a bottom-up parser you're discovering what you're parsing by SHIFTING these bottom level (then progressively higher) symbols until finally you've seen enough to recognize what the constuct is and can do a REDUCE action using that rule.
So, using the above example, if after "x" and "y" you see something unexpected like "w" rather than "z" you can't generate a top-down error like "Invalid A", because you don't yet know this is part of an "A" (and indeed in a more complex grammar there may be multiple things that "x y" could be part of). So, in the bottom up parser all you can generate as an error is something like "z expected" given that you know "z" would be a valid next symbol (leading to a SHIFT action in this case).
The stack contains states, not symbols. A shifted state encodes the shifted terminal PLUS the information what nonterminals that terminal can be part of (in LR(1), each of those nonterminals is augmented with its accepted follow-set).
Really the only place where the above breaks down is when in the middle of a rule, after shifting a terminal (or "shifting" a nonterminal after reduction; this is called differently in literature but it really behaves like shifting nonterminals) all that context information is dropped under the assumption that it is not needed for the current rule, but is still available deeper on the stack and will resurface after reducing the current rule. At that point, an error would be missing this information.
There's been a movement away from LR parsing in favor of recursive descent, and it isn't only that many languages are not LR(1) (for example, C++ can require infinite lookahead). It's also that even if the grammar of the language is LR(1), the errors that users will make often don't fit cleanly into the Yacc/bison structure, so error messages will often be poor quality and improving them can feel like unstructured hacking. It's much easier to produce high-quality error messages with a recursive descent parser. Recursive descent is just closer to how people think, so language designers generally create languages that work well with it.
Now, if LR parsing were a simple topic, it would be OK for compiler classes to spend time on it. But it's complex enough that teaching it adequately takes a lot of time, and that time could be better spent on any number of other topics. Get the students to the point where they have a clean abstract syntax tree quickly and go from there.
The author of the article links to a paper which presents a better error recovery method for lr grammars which the authors implemented in a rust library. Does that help with the error messages much?
gcc's C and C++ parsers are handwritten recursive descent parsers. Years ago, Bison was used, and for C++ a third pass was required between the lexer and the parser to add the needed lookahead to differentiate complex type and function declarations. I didn't say anything about LL(1). The key thing here is that high quality compilers are doing it this way, eschewing parser generation tools.
I thought that GCC being a handwritten recursive descent parser was giving it a huge disadvantage compared to other compilers when it came to introducing new standards (I remember people complaining about it a lot in the past).
You're misinformed. clang also has a custom parser. Both gcc and clang have great support for new standards, though as those standards are evolving some features aren't there yet (but none of the missing features have anything to do with parsing difficulties). You can find lots of detail on feature coverage at https://gcc.gnu.org .
I had one job where my ability to write recursive descent parsers made me a “100x ninja rockstar” because they had a lot of parsing problems that no one recognized as such. I expressly did not use a parser generator because of the complexity to the build, the unfamiliarity of the rest of the team, and I considered them to be overkill for the domain.
I agree with 90%-95% of the article. I would phrase it more pithily as:
1. Use recursive descent for recognizing existing languages, where you have a good test corpus. (And very good point about testing INVALID inputs too!)
2. Use grammars to design new languages
I follow this with https://www.oilshell.org/ -- the compatible OSH is a huge recursive descent parser, and Oil is designed with a grammar.
Guy Steele and Russ Cox also mentioned this methodology -- design the language with a grammar, and then implement it a second time by hand. Then you REALLY know it! :)
It is definitely possible to not know the language you're implementing / creating!
---
As an undergrad, I never tried to design a language, or even thought about it honestly. But I agree that undergrads should know of both approaches, without necessarily getting into the details of CFGs.
It would be a mistake to just teach recursive descent.
Here are my thoughts as to what should be covered with regards to parsing in a compiler course context:
* The lexing/parsing split, and that lexing is basically applying regexes over and over again. (The technique of using a regular expression engine to write your own lexer is a fun idea, but probably should be cut for time).
* What "context-free grammar" means, and the more-or-less standard syntax for writing CFGs.
* Explain how to build a parser with recursive descent as the example. But...
* Demonstrate expression parsing that supports precedence and both left- and right-associativity. We know how to do this very easily algorithmically (there's several different methods, in fact), it's very natural to describe in this manner, and it's so goddamn common that languages have precedence. (As a side note, expressing precedence via AddExpr/MulExpr in grammar should be covered here, since it's commonly written as such in grammars, but if I were ever to build a parser generator, I'd support a syntax that didn't require manual factoring like that).
* Introduce LL and LR parsing in theory. Do not introduce first/follow sets, or the state machine construction for an LR parser. Instead, the points to cover are what makes the languages hard to parse (e.g., ambiguous). Strict adherence to LL/LR grammar isn't necessary and sometimes counterproductive (left-factoring LL grammars is the case in point here).
* Do cover dangling-else ambiguity at some point. It's a great example of something that wasn't discovered until people started applying theoretical tools to existing languages (in this case, ALGOL).
Like the author, I think there is some merit to LR parsing as a theoretical concept, but the emphasis placed on it in existing compiler courses and textbooks is way too high. The end goal of parsing theory, for most students, should be to understand how to build languages that admit unambiguous parsers that require no backtracking, and LR parsers cover a very large subset of those languages. It is also helpful to get people to think about lexing and parsing in more formal terms--I have definitely seen a few people go "oh, that makes things much easier!" when they see the divide between lexing and parsing.
If you mean a parser for a notation expressing regular grammars, then it only takes a couple of hours to knock out a recursive descent parser (if you know what you're doing).
I mean implementable by your standard regex library in your favorite programming language, which is largely a PCRE-style regular expression (although backreferences and lookahead/lookbehind aren't necessary).
Recursive descent parsers have trouble with left-associative operators if you only allow recursion. But if you allow iteration, you can build up the parse tree while iterating through terms at the same precedence level. For example, this defines + and - as left-associative at the same precedence level:
// EBNF is SUM = PRODUCT , { ("+" | "-") , PRODUCT }
AstNode *parseSum()
{
if (auto lhs = parseProduct()) {
while (1) {
if (consume(T_PLUS)) {
if (auto rhs = parseProduct()) { // You'd call parseSum instead to get right associativity
lhs = mkAst(A_add, lhs, rhs); // Instead, left associativity happens here
}
else {
return nullptr;
}
}
else if (consume(T_MINUS)) {
if (auto rhs = parseProduct()) {
lhs = mkAst(A_sub, lhs, rhs);
}
else {
return nullptr;
}
}
else {
return lhs;
}
}
}
else {
return nullptr;
}
}
As a bonus, it doesn't blow out your stack when parsing long expressions.
Very few binary operators should be right-associative. Basically just ?: and = .
There are languages (such as ML) that allow the definition of new operators with a choice of precedence and associativity. Using a table like the one Hanson suggests and modifying it at compile-time is an obvious way to handle that feature.
When you do precedence like this in recdesc, you will get functions like parse_prec_1, parse_prec_2 and so forth. If you generalize this function body, and pass along a "current precedence", then generalize the precedences for each operator into a table, you have just reinvented Pratt Parsing.
In his linked Which Parsing Approach article he says this,
“Of course, I’m clearly somewhat biased, so perhaps these words from Guy Steele might be more compelling:
‘[Be] sure that your language will parse. It seems stupid ... to sit down and start designing constructs and not worry about how they will fit together. You can get a language that’s difficult if not impossible to parse, not only for a computer, but for a person. I use Yacc constantly as a check of all my language designs, but I very seldom use Yacc in the implementation. I use it as a tester, to be sure that it’s LR(1) ... because if a language is LR(1) it’s more likely that a person can deal with it.’”
I’m just starting to dabble in writing a compiler and this sounds pretty reasonable to me, basically using Yacc as a linter for your grammar but not for actually generating your parser. What do more experienced people here think of that?
> What do more experienced people here think of that?
"Experienced" person here. I've done a parser generator grammar or two. I've implemented several parsers.
I mostly agree, but it depends on your use case.
Most people writing parsers are doing so for a bespoke language, usually for config of an internal-use-only program or something like that. Such parsers won't be great, but they will work for people that:
1. Know the language, or
2. Are expected to know the language.
In that case, I agree with Guy Steele.
Why these two caveats? Because as people have said, error reporting from the parser will be subpar. This includes hand-written recursive descent ones. So knowing the language is important.
(The second item is specifically for situations where new employees might not know the language yet, but it's part of their job, so they will learn it.)
This is, by far, the greatest amount of parsers in use. I would estimate 99.5% or more.
Of course, most people don't think about those parsers; they think about the hardened parsers for general-purpose languages in wide use, but while they get a lot of attention, they are few and far between.
Guy Steele's advice doesn't apply as well to such situations. For me, I wrote a parser for an existing language. I had to write the parser to that, and that language had a lot of baggage. It made no sense in that case.
For the parser I'm currently writing, it also makes no sense. This is for what I hope to become a widely-used general purpose language. Sure, I could try, but I don't think it would work.
To see why, look at perhaps the two newest big languages: Rust and Zig. Both are not that old (Rust 15 years-ish, Zig 8 years-ish?). And yet, I suspect both use handwritten parsers. I suspect both are recursive descent.
Why is this? These languages shouldn't have that much legacy baggage, like C or C++, right?
It doesn't matter. General-purpose languages have to be everything to all possible users. This means they will have to be able to do some crazy things. And to do crazy things with plain text, well, you need as much power as you can get.
This is why general-purpose languages basically always become context-sensitive. Context-sensitivity allows a recursive descent parser to bring the full power of Turing-completeness to bear on a problem.
Guy Steele also alludes to this in his classic "Growing a Language" talk. [1] He talks about how you need to create a language that users can grow. There are easy ways to do this (like having functions), but the easy ways are all limited. If there is something that must be a language feature, like format printing [2], it can't be added by users.
Adding those other features usually require context-sensitivity. Format printing definitely does when it allows you to use variable names in the format string.
But what if you make a language that any user can extend in any way, even if it's not easy? The result is my language, Yao. (Example at [3].) It's context-sensitive because it has to be.
So basically, Guy Steele is right for everybody but the creators of big general-purpose languages. But most people think he's talking about the big general-purpose languages. He might have been, but if he was, he's wrong.
That doesn't mean that I don't consider the ease of parsing when I add a feature. I definitely do. But I come at it from the human perspective, how readable it is to a human. This tends to give me much the same thing while giving me power.
IMHO the majority of engineers not understanding lexing/parsing beyond generators has held back good engineering of incremental parsers/lexers for a long time.
Meanwhile, incremental lexing and incremental recursive descent parsing and LL are actually not that hard (incremental LR is trickier).
It's just that few people really understand how any of it works to implement it.
This is not a "it was not needed" issue. Lexing times of files for syntax highlighting is still "seconds" for larger files, for example.
Meanwhile, i can explain near-optimal incremental lexing to someone who knows how lexers work in a few bullet points, and it's trivial to implement:
1. As your lexer looks at characters, keep track of the maximum lookahead and lookbehind used during a lex.
2. When some range of the string to lex is invalidated, adjust the invalidation bounds to include the maximum lookhead/lookbehind as well.
3. Relex the invalidated portion of the string.
That's it.
For optimality, compute the maximum lookhead/lookbehind on a per-token basis, instead on a per-lex basis (this is trivial if your editor is already storing the tokens under the text somewhere)
If you store position info in your tokens, you can easily adjust as well.
For incremental recursive descent parsing, it's similar - only rules who ever looked tokens that changed could be affected by them. You can easily keep the interval of tokens that a given rule looked at.
(If you want truly optimal, you can store the set of intervals looked at, but this is almost always overkill)
If all you ever have is a generator, you will never implement any of the above until your generator does.
I have watched tons of projects spend lots of time optimizing grammars, doing weird caching, etc, when a few simple/better algorithms would have saved them from all of it.
(which is why in that sense, tree-sitter has been a revolution)
This is extremely interesting to me. I looked at the papers tree-sitter referenced but they were all written assuming an LR/generator-based parser and I couldn't figure out how to adapt it to recursive descent.
Let's say I've parsed an AST
binary_expression
lhs: number
rhs: binary_expression
lhs: number
rhs: call_expression
func: identifier
args: call_argument_list
[0]: number
[1]: number
and I've detected that `rhs` on line 3 has changed. So I go back to the start of the node. What do I do then? How do I know I need to parse an expression there and that it belongs to the rhs of a binary expression?
In the literature, this is solved by storing the parse state. How do you do that with recursive descent?
Is there something you'd recommend reading to understand how to do incremental parsing for recursive descent specifically?
> IMHO the majority of engineers not understanding lexing/parsing beyond generators has held back good engineering of incremental parsers/lexers for a long time.
That's a touch unfair. You could easily extend that statement to "The predominant languages of C and C++ have held back good engineering of parsers/lexers." period. This is due to the fact that you effectively have to parse the universe to resolve the symbol table in both languages.
In addition, people didn't need incremental parsing until syntax highlighting became widespread via LSPs. Before that, you were always working at the compile level, so why put in all the work for an incremental parser you were never going to make use of?
People have recently been designing the languages, themselves to better support parsing/lexing paradigms that can be incremental and restarted.
This is why most modern languages have converged to having specific syntax to delineate the difference between variable name and variable type, for example.
Rust, for example, has fairly funky things around "unary minus" because of its design for parsing and lexing.
> In practise, a recursive descent parser can be thought of as meaning "a hand-written written parser": unless you go out of your way to do otherwise, nearly all hand-written parsers are recursive descent parsers.
I don't think this is true. Yes, almost all hand-written parsers are recursive descent (top-down), but not all recursive descent parsers are handwritten. PEG parsers are top-down, but can be generated very similar to bottom-up parsers. See for example PyParsing. Packrat parsing seem to me to be the best of both worlds and there is not much use for bottom-up parsing anymore.
I was quite impressed with GLR parsers, as they can deal really well with ambiguity and are relatively easy to implement. They have limited adoption though as they have high constant overhead and hand-written recursive-descent parsers can easily outcompete them for most languages. The Elkhound paper [1] explains GLR parsing in detail and is quite enjoyable to read.
The point of TFA is why reach for GLA, why even pay a cost for ambiguity, if you have control over the grammar, and can specify an unambiguous one to begin with?
IMNSHO we should be aware of "big hammer" solutions but not be afraid to pick concrete representations when they make things easy on the machine. (in particular, tests using the former are wonderful double checks for deployment candidates using the latter)
[when would specifying an unambiguous grammar be premature optimisation? when changing the language later is cheap (unlikely!), or when the strings in that language are always going to be of size <= k. Nerd snipe: what —to order of magnitude— is k ca. 2023?]
Unambiguous grammars are easy not just to machines. Most importantly, they matter to humans. Reading something like Lua or Golang or SQL is usually much easier than C++, to say nothing of Perl.
Oddly enough my first project as a programmer in 1981 was writing a recursive descent parser in Fortran 78 (for Jovial language). Since I had no degree in CS, I previously had zero clue what a parser even was. Fun way to start 4 decades of programming.
One point though: The author compares context-free grammars with LR grammars on the one hand and recursive descent (LL) parsers with LR parsers.
But in theory, recursive descent parsers are context-free as well. It's just so damn simple to add context to such a parser. And whenever someone does that, the language design itself suffers.
As an example, consider a language like ML, where you want to distinguish module-level identifiers from value-level ones. In context-free languages, the tokens actually differ (e.g., starting with capital letters). If you add context to your parser you might think that constraint can be relaxed. But how do you handle implicit imports then?
As a self-taught programmer with no formal education, I figured writing a lexer/parser toolkit would be a good way to bootstrap my CS knowledge (about 20 years ago). I went down the rabbit hole for months on it, really glad I did.
I agree strongly with the point about how making parsing a major component of teaching compilers being misguided (as anyone who read my compiler series would know), not because learning parsing is unnecessary but because going more in depth about the rest is sacrificed in favour of what is one of the least interesting parts of the challenge of writing a new language.
There's more to do in improving parsing, but it's still one of the best understood aspects of parsing.
That said, I do agree with the article that it's a problem that many syntaxes have complex parser requirements. No grammar I know of requires recursive descent, but many requires violations of layering etc. not catered for by tooling that strongly favours manual parsers.
Now, I prefer manually written parsers because I think (having tried and failed to write better options many times) that current parser generation tools are rarely suitable for writing production parsers.
That said, I wish people wrote their grammars for a tool as a validation exercise, to give an incentive to design grammars that are more regular and easier to parse and simpler.
As an example, I love Ruby. Having worked on my Ruby compiler project (though it's been in hibernation for a few years), I utterly hate Ruby's grammar. I don't hate the result, for the most part, but mostly because most of us avoid the more insane things Ruby allows [1]
And while some of the complexity are part creating what I love about Ruby, the grammar is full of horrible quirks (again [1]) at least some of which might have been reduced if one of the "test cases" of the design process for Ruby was to maintain and update a compact language description for a parser generator tool. Even if that'd not be the production grammar for MRI, using it both to maintain a sane compact grammar description and to validate other parsers against would be great.
I pick on Ruby because I love it and wish it was cleaner and it's a particularly bad offender in this respect, but so many languages have occasional warts that they likely wouldn't have if a formal specification of the grammar made them more obvious.
[1] E.g. 'p(% % )' is valid (it prints "%", because % without a left hand operand reads the following character as the start of a quoted string ending with the same character. 'p % % ' is not, because the "p" can be parsed as an operand, and so the first '%' is interpreted as an infix operator, while the second '%' is interpreted as the start of a quoted character. Fun times. Thankfully I've never seen anyone abuse this in real code.
I enjoyed the article even though I'm in the opposite camp.
> Context-free grammars, and their associated parsing techniques, don't align well with real-world compilers, and thus we should deemphasise CFGs (Context-Free Grammars) and their associated parsing algorithms.
I agree with this. I think CFG are highly overrated. Top down recursive descent parsers are simple and allow you to craft more human languages. I think building top down parsers is something every dev should do. It's a simple technique with tremendous power.
I think the source code for Scroll (https://github.com/breck7/scroll/tree/main/grammar) demonstrates how liberating moving away from CFGs can be. Easy to extend, compose, build new backends, debug, et cetera. Parser, compiler, and interpreter for each node all in one place. Swap nodes around between languages. Great evolutionary characteristics.
I'll stop there (realizing I need to improve the docs and write a blog post).
> And no one will ever be able to implement a tool that works with the input language
But what do you get with such a tool? Syntax highlighting and cryptic error messages on the syntax only. And then you hit a wall and if you want to offer more to your users (refactoring, autocomplete, etc) you have to reimplement everything anyway).
> CFGs are equivalent to modularity in language design. A choice, clearly, but one with significant upsides.
My point above aside, you bring up a great point, and something I need to improve on a lot in Grammar (modularity so Tree Languages can be run on many host systems). Now my wheels are turning. Thanks!
You can put a breakpoint on a function in a recursive descent parser and see the call stack. You can't put a breakpoint on a grammar production in Yacc and see the parser stack as a call stack in your debugger.
A grammar can be recovered easily only from a very clean Yacc file. A Yacc file can be a hot mess of ad-hoc predecence and associativity overrides which basically requires a specialist with a "Ph. D. in Yacc" to figure out.
GNU Bison (and only that implementation of the Yacc idea) has only recently added the possibility of the grammar having multiple start symbols: e.g. useful if you want to just parse an expression, rather than a file full of definitions.
In recursive descent, you just call whatever function you want, e.g. parse_expr(...).
Ehn, the problem with LR parsing is that it’s terribly slow and can only practically be done via generators, and I have yet to encounter a parser generator that can actually handle error correction properly (necessary if you want anything other than the first error to break everything after).
The theoretical argument for LR parsing is it is more powerful than LL, LALR, etc because PLT seems hell bent on ignoring LL* because of the arbitrary restrictions placed on what a “real” BNF grammar is allowed to do.
Knowing about LR - and PDA/state machine based parsing is useful, but pretending that LR parsers are somehow better than recursive descent is nonsense.
> The first is performance. Although there aren't, to the best of my knowledge, modern performance numbers, it is reasonable to assume that recursive descent parsing is generally faster than LR parsing.
Why is this safe to assume? My understanding is that shift-reduce parsers in general are faster than recursive-descent ones because they have a constant overhead per production rule, rather than potentially repeatedly creating a tree and then bailing out to try a different production rule if it doesn’t work.
I’ve tried to learn both of these several times each, and what happens is that each time I end up yelling at my computer, “but we don’t parse/read programs that way!” And I simply cannot participate in this level of fiction so I find another solution or hobby. Neither of them is a good fit for C derived languages and no developer I’ve done DX studies with decomposes code the way either of them do. I absolutely do not.
Last time I looked, the only one I found that’s close to right is SGLR.
What an excellent article that I agree almost entirely with.
I'd like to add nuance in two places:
> If you're writing a parser for a language which is ambiguous, you might be able to use the conflict resolution offered by YACC and friends. However, in general, I suggest avoiding YACC's conflict resolution if possible, as you have to at least somewhat understand the underlying LR algorithm to understand how conflicts have been resolved.
I'd say: Specifically for binary operator precedence and associativity, definitely use conflict resolution of Yacc and friends instead of rewriting the grammar. An ambiguous grammar for binary operators might look like:
E → E ^ E
E → E × E
E → E + E
E → (E)
E → num
Disambiguating in Yacc and friends involves a few lines like:
%left ´+´
%left ´×´
%right ´^´
whereas rewriting the grammar above to be unambiguous is a mess:
E₀ → EA E₁
EA → E₁ + EA
EA → ε
E₁ → EM E₂
EM → E₂ × EM
EM → ε
E₂ → E₃ EP
EP → ^ E₃ EP
E₃ → num
E₃ → (E₀)
Disambiguating by resolving shift/reduce conflicts means fewer states, whereas disambiguating by rewriting the grammar means more states. Perhaps, as Laurence points out in this article wrt. LALR and SLR: This probably doesn't matter at all on modern computers.
Any other shift/reduce or reduce/reduce conflict caused by some other ambiguity than specifically a binary operator, e.g. when you have multiple if-then-else-if-then-else and you want to know where an else belongs: You need to be quite into LR parser generators to understand what's happening; took quite a lot of guessing, trial-and-error when I worked with them, and I probably forgot again. I'd probably document these as non-trivial choices if I were maintaining a Yacc-like parser, and I'd generally avoid syntax that is ambiguous in this way (although I don't think you can always do that).
> If you need the best possible performance or error recovery, recursive descent parsing is the best choice. If you know you will need this and you're designing a new language, at least create an LL grammar, since from that you can derive a recursive descent parser. [Bear in mind that LL grammars are more annoying to write than LR grammars which is why I suggest avoiding LL unless you have this very specific use case.]
I would say: If you get used to parser combinator libraries, you can get a fast, readable LL parser that is also fast to write and does not require extra compilation steps like practically every LR parser generator. I.e. you can skip the part where you convert your grammar file into a native-$YOUR_LANG parser, because parser combinators are embedded DSLs, some of which are just as fast as their hand-written recursive-descent parser. Parser combinator libraries usually have excellent error reporting capability. (The best example I know is Haskell's diagnose package. [2]) And some of them have backtracking capability, so that for those few parts where your grammar isn't LL, you can go back and try something else.
Modern parser combinator libraries like Haskell's Parsec / Megaparsec has helper functions for generating a parser that effectively translates an operator associativity/precedence table into an LL parser; so you get the nicely readable description, and the transformed LL parser that goes with it; see Megaparsec's makeExprParser [3]:
expr = makeExprParser term table <?> "expression"
term = parens expr <|> integer <?> "term"
table = [ [ prefix "-" negate
, prefix "+" id ]
, [ postfix "++" (+1) ]
, [ binary "*" (*)
, binary "/" div ]
, [ binary "+" (+)
, binary "-" (-) ] ]
binary name f = InfixL (f <$ symbol name)
prefix name f = Prefix (f <$ symbol name)
postfix name f = Postfix (f <$ symbol name)
Encoding expression operator precedence into a grammar (via add-expr, mul-expr, etc, etc) and then directly implementing is very inefficient. For an N-level deep expression grammar a recursive descent parser will end up making N-deep recursive calls for every terminal (variable, constant)! In a recursive descent parser it's more efficient just to have a single ParseExpression() function that uses the simple "precedence climbing" technique to handle operator precedence, which will result in at most one recursive call per operator (vs N per terminal). You could even eliminate the recursion completely by using an explicit stack instead.
The same inefficiency applies to botton-up table driven parsers generated by parser generators too - in a naive implementation the parser will have to go thru N-steps of reduce actions to elevate any terminal up to a top level expression. A smarter generator may eliminate 1:1 "unit productions" of type mul-expr := add-expr in order to remove this inefficiency. As you say, the other way to handle precedence using a parser generator is using operator precedence & associativity ambiguity resolution, but this is a pain to develop.
> require extra compilation steps like practically every LR parser generator.
It is worth noting that the author's lrpar tool (linked in the article) doesn't actually require extra compilation steps or code generation. This perhaps comes at the expense of less flexible lexing and parser actions. But lrpar comes with a tool 'nimbleparse', which given a grammar and an input file parses the input file bypassing code generation. This reduces the grammar editing/testing cycle by quite a lot. Inspired by that I have been working on a lsp based version of the tool for a while, to make the whole thing as responsive as possible...
For the Fortran parser in f18 ("LLVM Flang"), I ended up writing a small suite of C++17 constexpr parser combinators to build a recursive descent parser over a well-cooked character stream. A grammar encoded with parser combinators is no less readable than the input to a parser generator, and has the advantages that you can easily automate the construction of a strongly typed parse tree, perform custom error recovery, enable and disable language features, and use custom token recognizers for some of the ancient weirdness that makes Fortran so fun.
One thing that I did this time that turned out to be a good idea: run all source preprocessing and file inclusion first, before doing any parsing; deal with continuation lines and comments, and put the "cooked" character stream that results into a single contiguous block of characters, and use that for parsing. It's really fast, and makes backtracking trivial. Add a data structure that maps every character in the cooked stream back to its source provenance (input file, macro expansion, whatever) and you can get derive good source locations for error messages from just a basic char pointer. (Add a reverse mapping and you've got the basis of an IDE's language server, too.)
TL;DR: Build-time parser combinators are worthy of your consideration for your next parsing task.
1. Part of compiler construction courses is to instill in the students the notion of metea-linguistic abstraction (SICP), namely to solve problems by creating and implementing domain-specific languages (DSLs) that permit declarative statements about the domain in a natural and therefore maintainable notation, which is separate from an engine that can interpret or compile it. Compiler classes are ideal to teach about DSLs because the topic is parsing and compiling/iterpreting anyhow, but not only that, they can see that compiler people do as they say: they use the techniques for compiler construction themselves, which is often but not always done using even more than one DSLs (one for lexers, one for grammars, sometimes even DSLs). Parser generators are also used outside of programming languages, for example in computational linguistics formal grammars are often used to model human language morphology, phonology or syntax (with ambiguous grammars). Tomita's GLR parsing - mentioned elsewhere in this thread - comes from that application area.
2. Another part of compiler construction tertiary education is to get across the utility of theory for practical applications and the closeness of theory and practice in computer science: Formal language theory is beautiful and well-developed, and to show to students that it is also useful because it gives runtime and space complexity guarantees and systematic approaches for reducing faulty implementations is a wonderful thing.