I had to write an expression parser recently and found the number of names for essentially the same algorithm very confusing. There's Pratt parsing, precedence climbing, and shunting yard. But really they're all the same algorithm.
Pratt parsing and precedence climbing are the same except one merges left/right associativity precedence into a single precedence table (which makes way more sense). Shunting yard is the same as the others except it's iterative instead of recursive, and it seems like common implementations omit some syntax checking (but there's no reason you have to omit it).
I had to write that because somewhat surprisingly I couldn't find an expression parser library that supports 64-bit integers and bools. Almost all of them only do floats. Also I needed it in C++ - that library was a prototype that I later translated into C++ (easier to write in Rust and then translate).
Last I used it, even pretty small grammars took forever to build. I had to be very diligent not to let the grammar cpp-file depend on anything else in my codebase.
Ha, nice to see this on HN: this article was pretty helpful to me to understand the concept a few years back when extending my PEG parsing library [1] with a Pratt parser; this mitigates the problem of PEG parsers not allowing left recursion and allows for a much more concise notation of grammars with operator precedence. Thank you Bob:
I love that way of embedding precedence tables without the "classic" repetitive EBNF pattern.
I will shamelessly steal this for my next parser library. Also, your documentation is outstanding.
(tiny typo - "Pratt" is spelled "Prett" in one sentence on your page)
Pratt or TDOP parsers are my favorite design when coding the parser directly in a host language, with parser combinators a close second (but it edges ahead depending on the host language).
But when it comes to writing a larger parser with many production rules, I'd rather use a parser generator based around an Earley parser. You can split production rules and organize their post-processing into an AST a lot easier .. but providing good error statements is still a bit of a hassle.
I haven't found a parser generator that makes it painless to provide good error messages. Writing the parser directly in the host language has always been much easier in that regard.
I've been working on a parser that parses while typing and gives type and syntax errors as soon as possible but never any false positives. I didn't expect it to be that difficult. I still haven't really grokked it and any bugfix usually ends in a game of whack a mole over integration tests. Any hints?
One way I found helpful in giving precise parsing errors without causing excessive false errors is to have the parser auto-completing the missing tokens with placeholder error nodes. E.g.
if (cond1 And ) { ... }
The And operator is missing its right operand, but everything else is valid.
The parser knows the And operator needs two arms. It can complete the missing arm with an error node.
fn parseCondition() {
let leftNode = parseBoolExpr()
if (lookaheadToken() is And)
return parseAnd(leftNode)
if (lookaheadToken() is Or)
...
return leftNode
}
fn parseAnd(leftNode) {
consumeToken()
let rightNode = parseBoolExpr()
if (rightNode != null)
return AndNode(leftNode, rightNode)
else
return AndNode(leftNode, ErrorNode("missing operand", location))
}
In this way, the conditional expression is parsed to completion without interruption. Multiple errors can be handled with completion and recorded. At the end it's a matter of traversing the AST to print out all the ErrorNodes.
It depends on the language you're parsing, of course. But the most important thing is to know when your parser is on the right track again. The (most of the time) easiest (but not necessarily first possible) such point is the next definition statement.
And make sure to _always_ consume (at least) one token (even if just adding it to the error expression in the AST), so there is no possibility of an endless loop.
sklogic told us here that his strategy was to use two parsers: a fast, generated one by default; a separate one for good, error handling. If using that strategy, then we might be able to build some error handling in as part of the grammar. The generator ignores it for one engine and uses it for another.
That's good advice.. it's especially useful in an Earley design because it is much faster when it doesn't have to keep explicit back references for derivation, but those references are needed for explaining the current context. Adding a switch for keeping/discarding history (and a compiler pass for whether or not to fold the epsilon-DFA into collapsed states, "NNF" style) would automate the fast-vs-slow|informative variants. Add in some debug-only output in postprocessors and it wouldn't really need any syntactical changes to the parser generator. I may try this soon..
A few years back I had to implement a new parser for a custom expression language to replace an old parser that didn’t give very good errors and was hard to extend. I never did the traditional CS track so parsers were kind of mysterious black magic so naturally first thing I did was search HN.
I stumbled on an older parsing thread [1] with a link to a toy Pratt implementation [2] made by an HNer… and shamelessly ripped it off to write my own production Pratt parser implementation.
Great success! Writing a Pratt parser by hand is a lot easier than I thought and like the comment says, adding type information was trivial.
I met Vaughan Pratt at a math seminar a few years back, where he was talking about his current interest, Chu spaces. While we were waiting in line to get coffee, with the coffee stations labeled "regular" and "decaf", Pratt said, "But where's the context-free coffee?" and then laughed at his own joke. Later at lunch he was telling us about a physics experiment he set up in his pool shed; "My wife told me to do it in the pool shed so that if it explodes it won't destroy the house."
I recently have implemented a Pratt parser for a language prototype. I initially looked at several parser generators, but while they are easier for writing down the grammar of your language, you need to do a lot of work to make them actually spit out working code that integrates well with your project. A Pratt parser can be implemented with just a few tens of lines of code in whatever language of your choice. You just have to mentally switch from thinking of your grammar in terms of production rules to thinking in terms of operators. For example, "if" becomes a prefix (nud) operator, and for function calls, the opening parenthesis after a function name becomes a binary (led) operator.
One issue I have with the tutorials for writing Pratt parsers is that they usually only describe how to parse simple and very regular expressions. You can still use Pratt parsers to parse much more complicated programming languages, but this requires a bit more work and some modifications to the main parsing function.
I have parsed a whole language using a precedence-based approach before - the first time I wrote a precedence-based parser.
It very easily slips away from you and ends up a maddening mess of extra rules and flags (and I keep making that mistake, to be honest, because the conceptual simplicity of a precedence-based parser - be it Pratt or shunting-yard or variations - is seductive), unless you remember that you can easily "mode-shift" between different methods.
E.g. "escaping" out from a Pratt parser to recursive descent (or anything else) for a sub-expression that is sufficiently messy to make it painful to model with precedence tables is fairly simple. It's worth keeping in mind (for myself too...) if your main parsing function and/or precedence table starts becoming too complex.
It really depends on how easy it is to determine the handover points. If the grammar gives you clear, unambiguous boundaries that don't require much context to determine when to switch, you should be able to switch reasonably easily almost no matter the grammar type.
I agree it's easier to apply this top-down, though - you just call into the other parser and get whatever subexpression back, and you "just" need to have it treat tokens that can't continue an expression as the end of the stream.
E.g. a trivial top down example would be any language where IF/ELSE is not an expression - you might parser IF (condition), then call into a Pratt parser, treat "ELSE" or "END" as terminating the expression, and just continue.
Bottom-up you can still handle grammars that unambiguous fairly easily if the delimiting tokens aren't valid in other contexts, but I suspect you're right you'll get into trouble more easily if the grammars are more complex.
I found this Rust tutorial[1] on Pratt parsers to be really easy to follow as well. I'm not a Rustacean but I didn't find not knowing Rust to be a barrier. I used that as a guide to write the parser for my experimental programming language Fur[2].
However, I'll caution anyone writing their own programming languages to read some wisdom from someone who has written a production-quality programming language[3]. Most programming language developers get bogged down in writing the parser and never even get into solving the real hard problems of language development, which are things like bytecode generation and garbage collection. The fact is, a naive recursive descent parser with some low lookahead number to handle left recursion will be performant and can be written in an afternoon for 99% of parsable languages. I'm not sure what it is about parsers that makes them susceptible to yak shaving, but it seems to be a trap a lot of people fall into (including myself!).
I don't know. This is becoming conventional wisdom but I think a reverse argument is equally true: aspiring language designers get too bogged down in bytecode generation or gc that they are unable to make a compelling language that adequately differentiates itself from existing languages that already do those things well.
Are you actually claiming that handwriting a recursive descent parser for most popular languages can be done in afternoon? Unless the language is a lisp, I find that unlikely in most cases. You can write a parser generator that generates a recursive descent parser. Writing the grammar will almost surely be significantly faster in the parser generator dsl than a handrolled recursive descent parser.
To improve your language design skills, you need to design multiple languages. Parser generators will get you a prototype faster and allow you to explore the language design space. You can target a high level language to do the hard parts. Once you are satisfied with a language, then maybe it makes sense to design your own custom bytecode or gc to solve its unique problems.
I do agree though that optimizing parsing at an early stage makes no sense and many do get bogged down there.
> Are you actually claiming that handwriting a recursive descent parser for most popular languages can be done in afternoon? Unless the language is a lisp, I find that unlikely in most cases.
Yes. I've written enough parsers that I can say that with a lot of confidence, and I'm not sure where your doubt is coming from. The parser for Fur was written in about 6 hours and probably 4 hours of that was tracking down pathological semicolon elision edge cases (0/10, would not implement semicolon elision in future languages). I'm sure that Nystrom has written many more parsers than I have.
The edge cases might be tricky to iron out for existing, mature languages, but when you're writing your own language I'd shy away from features which require crazy parser shenanigans like JavaScript RegEx literals or C++ templates, at least until you're far along enough to have justification for those kinds of features.
This is also the sort of thing that I imagine ChatGPT, even the free version, could do in even less time.
There's one caveat here, which is that error reporting is one pretty hard part of parsing. But if you understood that you wouldn't be recommending parser generators; poor error reporting capabilities is exactly why many production-quality languages have hand-rolled their parsers.
> Writing the grammar will almost surely be significantly faster in the parser generator dsl than a handrolled recursive descent parser.
I'm not interested in hearing speculation on things I've benchmarked. When it comes to performance, benchmark or GTFO.
Performance isn't significantly different in most cases, and in pathological worst cases, it's often easily solved by memoizing (i.e. converting the recursive descent parser into a Packrat Parser). You shouldn't even memoize everything in most cases, because a lot of sub-parsers won't trigger backtracking, so there's no need to waste memory on memoization.
> To improve your language design skills, you need to design multiple languages. Parser generators will get you a prototype faster and allow you to explore the language design space.
To design a language, I don't need to write a parser at all in most cases. I certainly don't need to prototype syntax to design it effectively. Writing programs in your target syntax, even if you can't actually run them, is far more effective.
Related to the fact that parsers aren't the hard part, syntax isn't the important part, so if you get bogged down in designing the syntax that's also a lose. I genuinely think most novel syntaxes create more usability/entry barriers for users than the benefits they offer. Go with the syntax of an existing popular language if you at all can--the simpler/more popular, the better. Semantics, idioms, and ecosystem are all much more important in what makes a language useful.
> You can target a high level language to do the hard parts.
That's exactly why this approach is terrible. The hard parts are the entire point of creating a new language. If the hard parts are all implemented in a different language then people should just use that language instead of yours.
> Once you are satisfied with a language, then maybe it makes sense to design your own custom bytecode or gc to solve its unique problems.
Still no--I don't think you understood what I meant when I said "the real hard problems of language development, which are things like bytecode generation and garbage collection".
If you're targeting an existing VM such as the JVM or CLR you won't need to implement custom bytecode or GC, and even if you decide to implement your own bytecode or GC, you may be able to use an off-the-shelf implementation. You'll note that my post said bytecode generation, which is hard no matter what syntax you're implementing or bytecode you're targeting, because it's translating the semantics of the syntax to semantics that the target architecture can understand. When I said implementing garbage collection was hard, what I meant was not that implementing the algorithms is hard (you can literally plug in an off-the-shelf garbage collector in most cases) but that calling the garbage collection algorithms at the right times is hard.
> I'd shy away from features which require crazy parser shenanigans like[...] RegEx literals
Note to aspiring language designers: you don't have to shy away; you can 80/20 your way there even in extremely early versions of your language if you rule that, for now, all regular expression literals must be parenthesized (so just tell authors to write e.g. `(/re/)` rather than `/re/`). With this requirement in place, it's about as easy to support them the lexer as it is to support other multi-character operators like `&&` or `>=`.
I call this trick the "eggex hack"—at least I would, except that chubot also chose the "eggex" terminology in Oil shell to describe a completely different concept for reasons that I don't understand.
Interesting idea. To check my understanding, I think what your saying is the following?:
The scanner has a "(/" token which triggers dropping into a different scanner/parser for RegExes. This secondary scanner/parser drops out upon seeing a "/)". Alternatively, the "(/" and "/)" could be replaced by a trivial one-token lookahead (which would allow "( /" to match, for example); the key here is to differentiate "(/" from "(" or "/" so that "(/" immediately tells you that you're in a RegEx. This obviates the need to differentiate arbitrary RegExes from other things which might occur after a division symbol "/" like in JavaScript.
If that's what you're saying, seems brilliantly simple! I don't love the user experience of that, but maybe that's just an aesthetic opinion. Knowing how difficult it is to implement JavaScript-style RegEx literals, this seems like a good idea, at least at first.
You get the gist of the benefits—to get to a point where you can "implement JavaScript-style RegEx literals" without too much effort by removing some barriers to implementation—but some of the conceptual specifics in your understanding are off.
> I think what your saying is the following?:¶ The scanner has a "(/" token which triggers dropping into a different scanner/parser for RegExes
Not quite. First, you don't have separate `(/` and `/)` tokens. The `(/` character sequence is the start of a single pattern literal (the same way that if you're parsing JSON, the opening double quote is part of the string literal). What it does is serve as an unambiguous symbol that yes, what we are seeing here is a literal for a PCRE-style pattern. Secondly, you don't have to have a separate scanner. That's sort of the point—supporting "eggular expressions" is easy because tokenizing them is drastically simpler than the hacks regarding context-sensitivity that you must weave into both your parser and scanner (and complicate the scanner–parser contract) if you want conventional unparenthesized literals. Adding support for the parenthesized form to your lexer amounts to just a few lines of code, and not only that, but very simple, very boring code of the same sort that you have elsewhere—easier than the code to support numeric literals, even—whereas the same is distinctly not true if they are allowed to be unparenthesized.
(NB: I mentioned 80/20 before. These "eggexes" are also the good kind of partial implementation: a pure case of forward compatibility—if you support the parenthesized form now, there's nothing stopping you from relaxing the paren constraint in the future and announcing that you now support unparenthesized literals.)
> I don't love the user experience of that, but maybe that's just an aesthetic opinion.
Yeah, aesthetic is right. I actually prefer them parenthesized, anyway, for entirely unrelated reasons. Similar constructions like "foo".bar and [].baz, while technically permitted in the grammars of several languages, raise my hackles. It's no less true concerning literals used for pattern matching. Parens need to be inserted on the lhs of the dot for anyone who wants my approval in code review. As it happens, though, there are benefits to readability on multiple levels (i.e. for both humans and machines).
> Secondly, you don't have to have a separate scanner. That's sort of the point—supporting "eggular expressions" is easy because tokenizing them is drastically simpler than the hacks regarding context-sensitivity that you must weave into both your parser and scanner (and complicate the scanner–parser contract) if you want conventional unparenthesized literals.
Hm, I'm not sure I understand this bit.
It seems bad to use the same scanner for the tokens inside your regular expression as for the rest of your language. Two examples:
1. In my langauge, "3.14" is a token, but when parsing an eggex you'd want that to be at least 3 tokens because the "." is typically your "match one of any character" token.
2. My language doesn't have a "?" operator, but if we're using the same scanner for eggexes as the rest of the language, I'd have to include a token for that to support "zero or one instances" in the eggexes.
Sure, your main scanner can (and probably should) just grab everything between "(/" and "/)" as one token, but there's no avoiding going over the contents of the eggex with a scanner. I'm not saying you have to scan it during the regular scanning--that's probably not a good idea--but you're going to have scan it at some point in order to compile the eggex.
"I’m going to make a claim here that will be unpopular with some compiler and language people. It’s OK if you don’t agree. Personally, I learn more from strongly stated opinions that I disagree with than I do from several pages of qualifiers and equivocation. My claim is that parsing doesn’t matter."
Parsing matters for tooling, and that's about it. The recent Go retrospective on this goes into this somewhat, but if you have a reusable parser then you or others can easily create linters, formatters, and static analysis tools for your language, which really fleshes out the ecosystem.
These are certainly useful, but there's the major caveat that the "easy" stops when you have anything but the simplest precedence rules. For example, if unary `NOT` has a weaker precedence than other unary operators, or if `a < b < c` is to be treated differently than `(a < b) < c`, or if you want to forbid bitwise operators mixing with relationals to avoid confusion from fact that they have a different precedence in C (for historical reasons) than in many other languages.
Note also that Bison with precedence tags is more efficient than writing out the same grammar without precedence. And `bison --xml` lets you run the (trivial) machine yourself so you can do it in any language.
> anything but the simplest precedence rules. For example, if unary `NOT` has a weaker precedence than other unary operators
Unary operátors use precedence levels (binding powers) too, they are no different (regarding the existence of precedence levels) from infix functions. This is just a simplification of the precedence table in the linked article (which I haven't read).
> or if `a < b < c` is to be treated differently than `(a < b) < c`
If you want "<" to be right associative, that's exactly one of the things that's easy to do with a Pratt parser, change the binding power on one side for the "<".
> if you want to forbid bitwise operators mixing with relationals
I don't know how using a Pratt parser would interfere with throwing an error?
> If you want "<" to be right associative, that's exactly one of the things that's easy to do with a Pratt parser, change the binding power on one side for the "<".
It's not about associativity at all. "a < b < c" is a ternary operator that compares three operands.
OP is saying that you can write (a < b < c) in C and the likes, but it behaves as if ((a < b) < c), which is pretty much never what you want - hence why ternary and N-ary comparison operators like the Python one are desirable in the first place.
Semantic actions plus a specific node for grouping (parentheses) makes handling your examples easy (e.g. for your a < b < c example, you just need to be able to distinguish ot from the grouped expression. You can do it by adding various flags to change the parser behavior too, but that certainly easily turns into a mess.
An interesting variant of Pratt which I haven't seen elsewhere, and should have written up, is to allow application as juxtaposition, i.e. shell command-like syntax rather than FORTRAN-like. I don't remember the code, but it was a simple mod. Long ago, that plus (inefficient) combinators straight out of Burge gave me a quick prototype of an alternative Scheme reader intended for interactive use inspired by Haskell syntax. S-expressions could still be used as a natural part of the syntax. I guess there's something like that for Racket now.
You can generalize that into handling operators of various arity both before and after the operator.
E.g. I once had a parser that would handle "x op y z". I don't recall why - it was 30+ years ago, but I can back up that it's a fairly simple modification to expand precedence based parsers this way.
This is always a nice read. I was going through Bob's interpreter book and I wanted a more complete or interesting parser. This is a very nice addition to that.
I wrote a parser library based on (slightly extended) Pratt parsing. It's a nice, different location in the design space of parsers. Very declarative, and it allows having great error messages out of the box.
Some time ago I went down a fascinating rabbit hole on Pratt parsers while writing a math expression evaluator. In addition to the OP and above-linked article, I found the following helpful.
It's commonplace, and one of a few very closely related algorithms (Pratt/TDOP, precedence climbing, shunting yard, etc.) that I guess you can call "traditional", but outside of people who regularly write parsers people are still often unaware of it.
Not least, I suspect, because while the pattern is obvious once you know it, a lot of people will implement their parser from looking at a (E)BNF representation that doesn't make it explicit. E.g. this is a small subset of Oberon 07's grammar:
expression = SimpleExpression [relation SimpleExpression].
relation = "=" | "#" | "<" | "<=" | ">" | ">=" | IN | IS.
SimpleExpression = ["+"|"−"] term {AddOperator term}.
AddOperator = "+" | "−" | OR.
term = factor {MulOperator factor}.
MulOperator = "*" | "/" | DIV | MOD | "&" .
When you know any of the precedence parser techniques and how to map a grammar to it, it's instantly obvious that this is modeling precedence rules, and it's trivial to turn it straight into a table, but if you've never written a parser before, isn't familiar with the theory, and just about figured out how to read EBNF, it's a lot easier to stumble your way to a working recursive descent parser than figuring out to generalize the grammar fragment above into reusable functions and a precedence table.
Pratt parsing and precedence climbing are the same except one merges left/right associativity precedence into a single precedence table (which makes way more sense). Shunting yard is the same as the others except it's iterative instead of recursive, and it seems like common implementations omit some syntax checking (but there's no reason you have to omit it).
Here's what I ended up with:
https://github.com/Timmmm/expr
I had to write that because somewhat surprisingly I couldn't find an expression parser library that supports 64-bit integers and bools. Almost all of them only do floats. Also I needed it in C++ - that library was a prototype that I later translated into C++ (easier to write in Rust and then translate).