Pratt Parsers: Expression Parsing Made Easy

eliben · on March 20, 2011

Uhm, I wrote this a while ago: http://eli.thegreenplace.net/2010/01/02/top-down-operator-pr...

[sorry for the plug, but I think my tutorial is more comprehensive in terms of explaining how the thing actually works]

silentbicycle · on March 20, 2011

I agree - I find your post is clearer than the original paper, Crockford's explanation, and Lundh's Python post. It helps that you start with just addition and multiplication, then add more functionality after explaining the basics.

munificent · on March 21, 2011

That post is excellent! Somehow I didn't find that when I was looking to learn more about TDOP. The only change I made from yours and Pratt's is that I split out the nud and led parsers completely. It seemed weird to me to lump them together just because they share a token when they aren't semantically related.

For example, the "(" token handles grouping in prefix position and function calls in infix, and those don't have anything to do with each other, so I split out PrefixParselets and InfixParselets into entirely separate dictionaries.

scott_s · on March 20, 2011

I slapped together a crude lexer that works and we’ll just pretend that tokens are raining down from heaven or something.

This perfectly describes all parsing discussions and papers I've read. I love it.

ShirtlessRod · on March 20, 2011

"I have a truly marvelous demonstration of this proposition which this margin is too small to contain."

fholm · on March 20, 2011

I love Pratt parsers (or Top Down Operator Precedence parsers rather), I recently re-wrote the IronJS lexer and parser by hand (from using one generated by ANTLR) using the techniques described in this blog post and saw some pretty hefty performance increments http://ironjs.wordpress.com/2011/03/19/new-lexer-and-parser-...

Edit, I also have a generalized version that can take pretty much any input and any output (F#) here: https://github.com/fholm/Vaughan

cletus · on March 20, 2011

A post about... Programming? This isn't the HN I know!

Good post. I need to ruminate on this some more but it might well apply to something I'm working on.

ms4720 · on March 20, 2011

how to do it in python: http://effbot.org/zone/simple-top-down-parsing.htm

silentbicycle · on March 20, 2011

Crockford has an article about it (http://javascript.crockford.com/tdop/tdop.html) in _Beautiful Code_, and there's a copy of the Pratt paper that escaped the clutches of the ACM's paywall at http://hall.org.ua/halls/wizzard/pdf/Vaughan.Pratt.TDOP.pdf . Spread it around!

beagle3 · on March 20, 2011

Unfortunately I can only upvote once. Way, way more complete than the post linked article.

ms4720 · on March 20, 2011

not a problem

ssp · on March 20, 2011

I believe this algorithm is also known as "precedence climbing", and it's the most common way to deal with operator precedence in a recursive-descent parser.

barrkel · on March 20, 2011

I'd say that having a single production for each precedence level is a more common technique in recursive descent, but it that can be slow because every factor needs to be parsed by recursing through all precedence levels. This technique avoids that, and can parse factors straight away before going into the priority checking loop.

Delphi uses this parsing method for expressions, and is one of the reasons it's so fast at compiling.

scottdw2 · on March 20, 2011

I don't actually know if Delphi is fast or not, I'll take your word on it.

However, I'd be skeptical that the parser is one of the main reasons for its fast compile times. In my experience, parsing is really a tiny amount of compilation time. The VB.NET compiler, for example, uses the same parsing technique, and it is ... "not fast" at compiling. It spends most of its time binding symbols.

barrkel · on March 20, 2011

Delphi's parser binds symbols and types the tree as it parses. The only thing left to do after parsing is code generation, which is only a couple of passes over the tree.

The parser (or more accurately, the lexer) is the bit of the compiler which ultimately limits its performance, because it needs to see every character of the source. The compiler can never be faster than linear in the length of its input. Metrics on the Delphi compiler actually show that string hashing is the hottest piece of code, and that's already optimized about as far as it can go.

scottdw2 · on March 20, 2011

Which means its fast because of the way it binds symbols, not because of the way it parses infix expressions. One thing to consider is that such a design hurts the usability of the language. It's a lot easier to not have to worry about forward declarations and "module" interfaces. I don't know Delphi, but I imagine what you describe relies on something like the "unit" construct from Turbo Pascal.

barrkel · on March 20, 2011

There are many things that make it fast - or rather, it's the lack of slow things that make it fast. Excessive recursion during parsing is one of those things that is avoided (and that was what was on-topic here). Your focus on binding is instead a focus on something that another compiler does particularly slowly. That a different compiler is fast is not due to it being fast at binding specifically; it's due to being fast, or at least not slow, at almost everything.

Delphi does indeed inherit unit syntax from Turbo Pascal. This is both a blessing and a curse; it makes writing well-structured programs easier by enforcing a logical consistency in ordering (programs more or less have to be written in a procedurally decomposed way), but it also makes writing programs with complex interdependencies between parts more awkward. The unit format is also one of the things that makes the compiler fast (or not slow); relinking a binary unit based on a partial compile is very quick, and as it's not a dumb C-style object format, the compiler can be intelligent about dependencies. Every symbol has a version, a kind of hash, associated with it, computed from its type, definition, etc. When units are relinked after a recompile of a unit, only the symbols need be looked up and their versions compared; if there is no version mismatch between imports and exports, then dependent units don't need recompilation.

Random factoid: the guy who wrote the original source for the current Delphi compiler was also one of the developers for the .NET GC, and CLR performance architect (Peter Sollich).

_tef · on March 20, 2011

correct.

the pratt parser is just a way of implementing such a parser.

to be technical, it is a form of left-corner parsing

statictype · on March 20, 2011

I've also played around with combinatorial parsing using F#'s FParsec (a clone of Haskell's Parsec lib) and I found very pleasant and powerful to work with.

amadiver · on March 20, 2011

Excited to see how this is integrated with Magpie