Hacker News new | past | comments | ask | show | jobs | submit login
Yacc is dead (arxiv.org)
154 points by benblack on Nov 28, 2010 | hide | past | favorite | 44 comments



(Article author here.)

I'm delighted to see this get some attention here.

I absolutely love these techniques for parsing, but as my primary research areas is static analysis, I haven't had time to revise this paper and resubmit.

As it stands, I may never get the time to do so. :(

I posted it on arxiv so that David could reference it for his Ph.D. school apps.

Since some of you have asked, here are the reviews:

http://matt.might.net/papers/reviews/esop2010-derivatives.tx...

I do have an updated implementation that's much cleaner and faster, and I've been planning to do that in a blog post. (Alas, no opportunity yet.)

David's also done another implementation in Haskell that's screaming fast and efficient on many restricted classes of grammars (like LL(k)). I'll encourage him to post that as well.

If you're interested in getting your name on a scientific publication and helping this get the attention of the scientific community, you can help us by creating an implementation of either technique in your favorite language and beating on it to help find the inefficiencies.

(For instance, the original Scala version linked from this paper has memory leaks from the way it caches derivatives. We got around them by rolling top-level repetition by hand.)

Please email me if that's something you're interested in doing.

Can HN do science? I'd love to find out.


"this is not quantum theory, after all..." is one of the funnier comments I've seen on a review. Judging by the rest of it, I'm tempted to guess that it was written by someone from team-PLT :)

Could either you or David put the source for the LL(k) version up somewhere? Comparing it head-to-head with parsec (or its faster cousins) something fun to do.


Here is the git repo (over http) for my Haskell implementation which exploits the technique to be linear for LL(k) (the Zip module is where this technique is implemented). The constant overhead for the implementation is still extremely high because we need to compute a fixed point computation on the whole parse graph for every input token. I'm working on getting all that down...

http://david.darais.com/git/research/der-parser-3/


Thank you!


Here is my latest Haskell implementation that Matt is referring to (git repo):

http://david.darais.com/git/research/der-parser-3/


I haven't finished the paper yet, but this type of technique excites me as well. Recently, I did an assignment on the paper "memoization in top-down parsing" by Mark Johnson. Have you seen this paper, and do you think that there are any similarities between the derivatives of a CFG and continuations of parsing procedures (for variables and their related productions)?


I'll have to check out Mark's paper, but just by your description, I bet they're related.

For finite-state and pushdown automata, derivatives and continuations are two sides of the same coin.


How does your approach compare with Koen Claessen's Parallel Parsing Processes?

http://www.cse.chalmers.se/edu/course/afp/Papers/parser-clae...

I haven't studies either approach well but they seem similar in that both are like a breadth first traversal of the parser combinators, instead of the traditional depth first (backtracking) traversal. That is, when you have a union A u B you process this by advancing both sides in tandem D_c A u D_c B, instead of the traditional approach of first computing all results of A on the entire input and then computing all results of B on the entire input?

Parallel Parsing Processes don't seem to handle left recursion. On the other hand they parse S expressions in linear time without a special coded repetition operation ;)

When I find time I'm going to try to do an F# implementation.


Yes, breadth first traversal of the solution space is very similar in spirit to our approach.

I have since completely rewritten the Haskell implementation. You should really check it out, especially if you are thinking about writing one of your own.

(git repo) http://david.darais.com/git/research/der-parser-3

This implementation no longer requires a special coded repetition operator to get good complexity for regular and LL(k) grammars. This is a result of our progress w.r.t. performance of the theory since the paper's submission. My technique to achieve this uses structure derivatives (context based derivatives, or pseudo-equivalently, continuations) to avoid taking unnecessary parser derivatives. It in a sense "focuses" the parser computation on a small sub-parser.

Ping me if you have any questions; this paper has been getting passed around quite a bit lately and Matt and I have made progress on certain areas since the writing of the paper -- like that of the Rep hack.


emailed!


Here's what a grammar actually looks like in their scala version. This grammar is for arithmetic over the language where x represents 1, and s represents +. This only generates the parse tree, not the final answer. Comments are added by me:

  // The Nodes in the parse tree
  abstract class Exp
  case object One extends Exp
  case class Sum(e1 : Exp, e2 : Exp) extends Exp
  
  // Terminals
  lazy val S : Parser[Char,Char] = new EqT[Char] ('s')
  lazy val X : Parser[Char,Char] = new EqT[Char] ('x')
  
  // Definition of an expression
  // I'm pretty sure all the asInstanceOfs are avoidable/unnecessary.
  lazy val EXP : Parser[Char,Exp] = 
    rule(X) ==> { case x => One.asInstanceOf[Exp] } ||
    rule(EXP ~ S ~ EXP) ==> { case e1 ~ s ~ e2 => Sum(e1,e2).asInstanceOf[Exp] } ||
    rule(EXP ~ S ~ X) ==> { case e1 ~ s ~ e2 => Sum(e1,One).asInstanceOf[Exp] } ||
    rule(X ~ S ~ EXP) ==> { case e1 ~ s ~ e2 => Sum(One,e2).asInstanceOf[Exp] } ||
    rule(X) ==> { case x => One.asInstanceOf[Exp] } ||
    rule(EXP) ==> { case e => e } ||
    rule(Epsilon[Char]) ==> { case () => One } 
  
  // Actually run the rule
  val xin = Stream.fromIterator("xsxsxsxsx".elements)
  EXP.parseFull(xin)
  // return value => Stream(Sum(One,Sum(Sum(One,One),Sum(One,One))), ?)


What is actually state-of-the-art in parsing?

When I, as an Amateur, last looked into it, PEG[1] and extensions like OMeta[2] seemed to be the best options. I've heard good things about Parsec[3], too.

[1](http://en.wikipedia.org/wiki/Parsing_expression_grammar) [2](http://www.tinlizzie.org/ometa/) [3](http://legacy.cs.uu.nl/daan/parsec.html)


I am not so sure about the actual state-of-the-art (I have read some PEG papers and some OMeta stuff from vpri, plus used ANTLR, Bison, and Coco/R), but a very interesting comment on HN (http://news.ycombinator.com/item?id=1643715) points out that most production compilers use hand-written recursive-descent parsers, primarily due to practical reasons.

I have taken two compiler construction courses during my college years, one with LL(1) grammar using a recursive descent parser, the other one with LR(1) grammar using flex+bison tool-chain. I found the experience from the LL case helped me tremendously in the course of writing the LR-language compiler. I think that recursive-descent experience also helps understanding attribute grammars, too. Probably the best introduction to recursive-descent parsers is from Niklaus Wirth's compiler construction book (http://www-old.oberon.ethz.ch/WirthPubl/CBEAll.pdf).


What are the practical reasons? Dealing with non-uniform edge cases and other strange constraints? Integrating the stuff the compiler needs to do with the parser-generator?


I can imagine one practical reason is that a lot of stuff (e.g. type information, variable bindings) goes top-down, but bottom-up parsers reduce rules in the opposite direction and so it takes extra effort to make the mental model of programming fit with the reduction strategy of bottom-up parsers.


Vladimir Safonov's Trustworthy Compilers is another good reference on this subject, in addition to the books already mentioned here on HN. Safonov works through some examples of real-world languages, and argues why hand-written parsers are superior to Yacc-style parser generators. He gives examples of ALGOL-68 and FORTRAN.

I don't know of any definitive text that covers a wide range of practical examples that I've been totally satisfied with, though. Even Safonov waives his hand a little, saying the parser is the one part of the compiler to be written by the most experienced and best compiler engineer on the team, because it influences the performance of everything after.


The comment I referenced mentions that the context available in recursive-descent parsers can be used for error recovery, e.g., phrase-level recovery. The excellent "Programming Language Pragmatics" by Michael Scott has a good chapter on this, so does "Modern Compiler Design" by Grune, Bal, Jacobs and Langendoen.


There was a really cool parsing paper at POPL this year [1] that goes 'beyond CFGs'... but as far as I know their system YAKKER is still in development / unreleased.

I think from a user's perspective, e.g. someone wanting to design a DSL, it would be nice to be able to write down arbitrary CFGs and not necessarily have to know lots of technical parsing details ("what are all these shift-reduce conflicts!?"). Maybe PEGs aren't the best choice for this (no left recursion)-- MetaBorg's SGLR is probably state-of-the-art then like Zef said.

Though one of the nice things about PEGs is no ambiguity; MetaBorg will do type-based disambiguation but only after parsing finishes. I'm working on an undergraduate thesis on 'extensible syntax' right now, trying to jump off from some of their stuff (but using a variation on Earley parsing) in particular to see if we could use types to help disambiguate incrementally during parsing.

[1] http://portal.acm.org/citation.cfm?id=1706347


The PDF from AT&T Labs for those of us without ACM digital library access:

http://www2.research.att.com/~yitzhak/publications/ddg-tr.pd...


PDF of paper that isn't behind a paywall: http://www2.research.att.com/~yitzhak/publications/ddg-tr.pd...


State of the art in parsing is SGLR (http://strategoxt.org/Sdf/SGLR) and GLL parsing


See also Adam Megacz' SBP ("Scannerless Boolean Parser") (http://research.cs.berkeley.edu/project/sbp/). Boolean grammars are a superset of context-free grammars that can do some interesting things. For example, one can write a scannerless grammar for Python that handles the significant indentation in the grammar.


Thanks a lot for the link, I'll look through it.

I also found the GLL paper at p. 113 [123] of http://ldta.info/2009/ldta2009proceedings.pdf


The SGLR parser in Strtego/XT is not state of the art, at least not for production use. It only supports pure seven bit ASCII, it is a throwback to the 1970ies.


Well, Antlr (http://www.antlr.org/) has replaced yacc for most of the practical work already. Daniel Spiewak also mentions parser combinators and how GLL parsers can be much easier to use and yet fast enough most of the time (http://www.codecommit.com/blog/scala/unveiling-the-mysteries...). He mentions parser combinators as domain specif languages for creating parsers (http://www.codecommit.com/blog/scala/the-magic-behind-parser...).

Despite all of these, adoption of better techniques is really slow. As usual.


Heh... Yacc, Bison, Antlr...


Fascinating. I don't understand it all yet - I'm reading it carefully but it'll be weeks before I really get it.

However ...

I'm getting the feeling that this encompasses properly a feeling I've had about parsing for some time, that there should be a way of parsing the entire text, with the parse settling onto the text all at the same time. I don't know if this is what it's saying, but that's the sense I get from it.

But even if it isn't, it looks intriguing.


You mean something like this?: http://blog.sigfpe.com/2009/01/fast-incremental-regular-expr...

Note that the post has 'prerequisites' at the beginning, which you will need to read, but they are actually pretty cool.

Also I am not saying this is exactly what you mean, I'm just suggesting it as a possible connection.

This sort of thing is one of the admittedly-rare exceptions where computer science is actually making surprising amounts of progress in relatively practical fields. Parsing has gotten noticeably easier in the past ten years, if you know where to look for the right libraries, and it has been affecting my programming quite a bit. Often, a parser is the "correct" solution, but we used to reach up for hacked up crap with regular expressions or worse because it was ten times easier and did 80% of the job (and ignore the 10% that is a serious security vulnerability since everybody always does). Now it's maybe twice as hard, or, given how easy it is to underestimate the difficulty of getting the hacked up crap to actually work everywhere in the real world you need it to, sometimes it's just flat-out easier if you make a full accounting of costs to actually do it correctly.


Here's a quick summary of the broader implications of that link, because I had trouble wrapping my head around it at first.

Suppose you have the ability to take a chunk of text and construct a partial parse state from it. In the case of regexp matching, these partial parse states are functions mapping one state of the regexp matching automaton to another. You need one more thing: the ability to append two of these partial parse states, combining them into one. In the regexp example, this is just function composition. The key here is that this operation must be associative, and there must be an identity element: some partial parse state such that combining it with another state doesn't change anything. And its result must also be a partial parse state. This combination of parse states and an associative binary operation is called a monoid.

Once you have these conditions fulfilled, you can do all sorts of fun stuff. For instance, you can represent a string as a tree of chunks, and cache partial parse states at the nodes in the tree. That way, when you change the string, you can recompute the changed parse states in something like O(lg n) time, rather than going through and re-parsing the entire string. Or you can almost trivially parallelize your parser.

A week ago, I did exactly this: I had a language that needed parsing, and I wanted to incrementally reparse when I changed the (potentially very long) string, so I used a finger tree and wrote an incremental parser. It works beautifully.


Is it me or this would fit quite nicely into an IDE (ie. the language-specific editor) ?


I think it would but regexps are not enough to parse most interesting languages. Perhaps you could extend it to general parsers, but I think it may be impossible to do it efficiently for general context sensitive parsers, because unlike regular expressions the parser can be in infinitely many different states when it arrives at the substring. Perhaps laziness can do some tricks though. Anyone have some ideas?


Cool reference, and I'll be working on that too over the next week or so to see if that's close to what I mean.

Thanks.


regular expressions are always the wrong tool for the job of parsing languages with recursive syntax. See, for instance, http://stackoverflow.com/questions/1732348/regex-match-open-...


Did you actually read the linked stuff, or did you just trigger on the keywords "parse" and "regular expression"? I did say it wasn't exactly what was looked for, but it would be interesting to see if someone could extend the results in the linked post from a regular expression to a parse tree. It isn't immediately obvious to me whether you could or could not.


Yes, it's possible to extend the method in the linked post to parse trees. Check out monoidal parsing:

http://comonad.com/reader/2009/iteratees-parsec-and-monoid/

This guy figured out how to turn parsers written with Parsec (an excellent collection of parser combinators; truly a pleasure to use) into monoidal parsers that you can use for incremental and/or parallel parsing.



I have since rewritten the Haskell implementation to compute fixed points on cyclic graphs without using pointers or Monads. Check it out if it interests you (git repo):

http://david.darais.com/git/research/der-parser-3/


So many useful links in this discussion, I consolidated them (and a few extras) here http://post.b3k.us/modern-parsing-bibliography


Why was it rejected by ESOP?


Well, if you want to know what ESOP rejection reviews look like: http://phlegmaticprogrammer.wordpress.com/2010/11/21/reviews...

and the response to these reviews here: http://phlegmaticprogrammer.wordpress.com/2010/11/21/respons...


Possibly because they hand-waved away a lot of formal specification of their solution. For example, there was no attempt to analyze the complexity of the generated rules trees.


Brzozowski's derivatives of regular expressions has been in use in the PLT compiler tools for ages now.


Better comments at programming.reddit, sadly: http://www.reddit.com/r/programming/comments/ed2pb/yacc_is_d...


Do I hear John McCarthy chuckle?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: