Parsing Techniques - A Practical Guide

10ren · on Aug 31, 2009

The table of contents doesn't seem to mention LL(*) grammars, which are used by ANTLR. I think this is the same thing as a "pack rat" parser (which from memory has a very straightforward implementation, but isn't an academic concept).

I've been writing parsers for a few years now, but I have found the academic work on it to be actively misleading - at least, for the specific things that I want to do. I'm including studying it as an undergraduate, teaching it to undergraduates, and studying it as a PhD candidate.

I've actually read papers where it appears the authors have been misled in precisely the way I had been. Or it could just be that I'm doing something quite strange and weird (i.e. novel and non-obvious).

haberman · on Aug 31, 2009

> I think this is the same thing as a "pack rat" parser (which from memory has a very straightforward implementation, but isn't an academic concept).

Also, packrat parsing was popularized in the academic literature by Bryan Ford's Master's Thesis "Packrat Parsing: a Practical Linear-Time Algorithm with Backtracking". It is best known in the context of Parsing Expression Grammars (PEG).

haberman · on Aug 31, 2009

> The table of contents doesn't seem to mention LL(* ) grammars, which are used by ANTLR.

It's true, this book mentions LL(* ) only in passing. There is a short section (8.3.3), which concludes by saying:

"LL-regular is probably of theoretical interest only; if a parser writer goes through that much trouble, the effort is more wisely spent on LR-regular. Still, it offers many interesting insights, and the parser landscape would be incomplete without it."

Notice that they call it "LL-regular" instead of LL(* ); after reading the somewhat scant literature about LL-regular, I have become convinced that LL(* ) is one in the same as LL-regular, which saw some research in the late seventies. I have been in contact with Terence Parr (author of ANTLR) to see if he comes to the same conclusion. I sent him the LL-regular papers (some of which he had not seen before), and have not heard back from him since.

I'm also working on a parsing project that uses LL(* )/LL-regular (http://www.gazelle-parser.org). It's too bad the technique has not received more attention, because it can be quite powerful.

> I think this is the same thing as a "pack rat" parser (which from memory has a very straightforward implementation, but isn't an academic concept).

No, packrat parsers are not the same as LL(* )/LL-regular. They are both top-down methods, but LL(* )/LL-regular is deterministic, meaning you build a table ahead of time that tells you what decision to make when you're faced with a choice. Packrat parsing, on the other hand, is based on memoized search.

Packrat parsing is linear time, linear space. LL(* )/LL-regular is linear time, constant space [0]. While both are linear time, LL(* )/LL-regular is faster. On the other hand, packrat parsers can handle all non-left-recursive grammars, while LL(* )/LL-regular can only handle a subset of such grammars.

[0] Technically the time could degrade to n^2, but no real language would trigger this. Also, the space is technically proportional to the nesting depth, but this is typically omitted.

> I've been writing parsers for a few years now, but I have found the academic work on it to be actively misleading - at least, for the specific things that I want to do.

Really? What do you find misleading? I've been reading more and more parsing literature, and while it can tend toward the impenetrable and mathematical, I've never found it misleading. And I quite like the book mentioned in this article -- if you don't own it already you should definitely get a copy if you're active in the field.

> Or it could just be that I'm doing something quite strange and weird (i.e. novel and non-obvious).

Out of curiosity, what are you doing? And is your PhD focused on parsing? I'm always looking to meet people who are active in the field.

10ren · on Sept 1, 2009

I found some aspects misleading, when applied to some specific problems. It's not in the reading or understanding (though that can be hard!), but in the applying.

But on reflection, this is very often the case with theoretical approaches: they can be misleading when you apply them to some specific problems. By "apply", I include the conceptualization of the problem, not just implementation.

Thanks for asking, but unfortunately I can't talk about the work at the moment, except to say it's more application than theory. I shouldn't have mentioned it when I'm not in a position to disclose, but I didn't realize the extent of the connection when I commented. Sorry - I feel really bad about this.

doty · on Aug 31, 2009

Aren't both packrat and LL(* ) deterministic?

I thought the difference was that LL(* ) uses look-ahead to determine which branch of an alternation to choose, where packrat uses backtracking, and simply tries the branches of the alternation in order.

haberman · on Aug 31, 2009

Your description of these two algorithms sounds accurate; I think your point of confusion is with the term "deterministic."

The fact that packrat parsing is based on search-and-backtrack by definition means it is not deterministic. Deterministic parsers never backtrack, because they are designed have enough information to always make the correct choice, by using lookahead.

One important distinction between the two is that a parser based on search returns the first parse tree it finds, but that might not be the only possible parse tree. If you have a well-defined order for the alternatives, then it will always return the same parse tree, but the grammar could still be ambiguous and capable of returning other parse trees also. PEGs sidestep this issue by defining away this ambiguity, and saying that the first parse tree is the ONLY parse tree -- this is the difference between PEGs and context-free grammars.

With deterministic parsers, on the other hand, the process they use to build their tables ensures that the grammar is not ambiguous. So when you get a parse tree back from a deterministic parser, you know it's the only possible parse tree for that input text.

doty · on Aug 31, 2009

Interesting, thank you for clearing that up for me.

Just to make sure I understand: you could view a PEG as having an ambiguity that is resolved by implementation.

It's a small and subtle difference that both LL(* ) and PEG grammars are by definition unambiguous and produce only one tree; they just accomplish it in different ways.

haberman · on Aug 31, 2009

> Just to make sure I understand: you could view a PEG as having an ambiguity that is resolved by implementation.

I wouldn't quite put it that way, because PEGs are defined abstractly, not in terms of any particular parsing method. For example, lpeg (http://www.inf.puc-rio.br/~roberto/docs/peg.pdf) parses PEGs also, but does not use packrat parsing to do so.

> It's a small and subtle difference that both LL(* ) and PEG grammars are by definition unambiguous and produce only one tree; they just accomplish it in different ways.

Yes, though I will editorialize a bit here and say that PEG's method of defining away ambiguity makes it impossible to know when there are true ambiguities in your grammar. With PEGs, if you have:

a -> b / c;

...it is undecidable whether this is the same as:

a -> c / b;

Take a common ambiguity like the if-then-else problem (http://en.wikipedia.org/wiki/Dangling_else). This is a real ambiguity that will affect how programs in your language are interpreted. But a PEG based tool will never tell you that an ambiguity exists, because to a PEG there is no ambiguity. It gives you no help in understanding whether any of your language's productions are ambiguous to a user.

doty · on Aug 31, 2009

> I wouldn't quite put it that way, because PEGs are defined abstractly, not in terms of any particular parsing method.

I thought one of the core things in PEGs was ordered choice, which implies execution order. Packrat parsing is just a space vs. speed tradeoff, converting super-linear time (thanks to backtracking) into linear time with linear space (thanks to memoization). Are there PEGs that have unordered choice? (lpeg doesn't)

> It gives you no help in understanding whether any of your language's productions are ambiguous to a user.

Point taken.

haberman · on Aug 31, 2009

> Are there PEGs that have unordered choice? (lpeg doesn't)

No, PEGs are defined in terms of ordered choice. All I'm saying is that they are not defined in terms of one particular algorithm that implements ordered choice. That's why I objected to the characterization that PEGs have their ambiguity resolved "by implementation." That's inaccurate; their ambiguity is resolved by definition, not by implementation. PEGs are unambiguous even if there is no parsing implementation.

bkovitz · on Aug 31, 2009

What is a specific way in which you were misled? (Maybe you can save me from the same fate.)

10ren · on Sept 1, 2009

It was to do with applying a particular aspect of the theory to a particular problem. I guess the lesson is the old one that theories can't always be usefully applied to every problem in practice, even when the theories are perfectly true. "There is no difference between theory and practice - in theory."

I'm sorry, I (mysteriously) can't disclose specifics. Actually, I think it would be of interest to vanishingly few people, and as I explained it, even the remaining ones would quietly slip away.

andreyf · on Aug 31, 2009

Editors, please change link to new edition: http://www.cs.vu.nl/~dick/PT2Ed.html

Have been self-studying from this, and it's a great book. Highly recommended.

bkovitz · on Aug 31, 2009

Is the second edition available for download?

andreyf · on Sept 1, 2009

Ah, not for free, no. Good call :)

zaidf · on Aug 31, 2009

My favorite parsing function: substr()