I nearly don't understand anything of this. It sounds cool, though.There is so m...

akkartik · on Dec 14, 2010

I created a list of articles a few years ago: http://www.delicious.com/akkartik/parse

Intense in parts, but they are all less dry than a textbook.

(I did have some background, though, so perhaps you'll need some prior reading..)

zohebv · on Dec 14, 2010

You need to know

1. Context Free Grammars (essentially regular expressions that can be recursive)

2. Two kinds of parsing techniques LL and LR. LR is much harder to implement. LL is straightforward recursive descent. You don't have to actually study LR parsing.

3. Tools like yacc (LALR parser, weaker form of LR but more efficient) can be used to generated to emit parser code in C or your favorite language. However parser combinator libraries are preferred over code generators. These libraries are typically easier to write in functional languages but tend to be LL parsers that don't handle left recursion well.

4. This paper proposes a technique that lets you use left recursion in a parser combinator library. So, the only reason you would now use yacc is parsing speed.

zerothehero · on Dec 14, 2010

> So, the only reason you would now use yacc is parsing speed.

"speed" means a couple different things. Someone might say the only reason you would use C++ rather than Python is speed -- i.e. you can always get within a constant factor with Python. Ignoring other aspects of the languages, let's call that true.

But Russ Cox is saying that you would also use yacc when you want a O() bound on the parsing algorithm (a linear algorithm). That is, the technique they advocate will produce parsers more than a constant factor slower than yacc.

Frankly I don't understand their rebuttal at all. The rebuttal is that it doesn't matter for practical purposes? That doesn't sound like computer science to me.

And practically speaking, it does matter, because plenty of people write exponential time regexps and don't know it, and it will blow up in their face on large input.

vorg · on Dec 14, 2010

> "the only reason you would now use yacc is parsing speed"

Wikipedia says another advantage of LR parsing over LL is the error messages are more localized, and therefore relevant. I wonder whether derivatives in an LL parser (e.g. combinators) would help with that disadvantage?

If not, and because derivatives run quickly in practice for correct grammars, but approach theoretical exponential speed for random inputs, perhaps the best use for derivative grammars is giving realtime feedback when typing code into an IDE which enforces correct input and editing (e.g. always producing a corresponding correctly-placed close bracket in the code somewhere when the programmer types in the open bracket), and only allows previously entered source code to be edited within the IDE. Such a use would give decent parsing speed and no need for error messages, the very function parsing derivatives are good at (assuming this research is correct).

sid0 · on Dec 13, 2010

Everything you need to know to understand the algorithm would be covered in a mandatory "intro to theory of computation" or "intro to automata" course in a CS undergrad degree. You could start by reading the first half of a good textbook: we used the book by Hopcroft, Ullman and Motwani: http://infolab.stanford.edu/~ullman/ialc.html and I liked it a lot, but I've also heard good things about Sipser's book: http://www-math.mit.edu/~sipser/book.html

mattmight · on Dec 14, 2010

Almost everything I know about automata and formal languages I learned from Sipser's book or Wikipedia.

necubi · on Dec 14, 2010

My class used Sipser, and it's generally clear and understandable.