Hacker News new | past | comments | ask | show | jobs | submit login
Parsing: the solved problem that isn't (tratt.net)
200 points by ltratt on March 15, 2011 | hide | past | favorite | 47 comments



About packrat parsing...

At my office I implemented Warth, Douglas and Millstein's left-recursion algorithm, and ran into two things. First, the algorithm described in the paper is kind of a mess, as the authors admit. Second, it doesn't actually work as advertised, as the OP discovered. What become a simple implementation taking a few days turned into many weeks of trial and error, because with each change I naively thought I was almost there. Now I know better than to go around trusting strangers with algorithmic candy.

The idea is to introduce a notion of what left-recursions are allowed and which ones aren't. Say we have a rule R which matched input M. Call M the "match". Say M has a sub-match, M' which also depends on rule R. Then M is allowed if and only if the submatch M' is in a certain sense smaller than M--either it is smaller in size, or M' and M depend on the same choice rule, except M-prime's choice is ordered after M's. In implementation, we take a "growing the seed" approach described in Warth et al, but with the ability to track multiple left-recursion points, and the left-recursion seed parses that every match is dependent on. For a pathological grammar this may end up exponentially slow.

I'm willing to bet this curtailment process is essentially the same as the process described in Frost, Hafiz and Callaghan's paper on left-recursive parser combinators[1], but I haven't looked too closely at it. I discovered Frost et al's paper only after I was done.

I may have done a bad job of explaining this, but our unit tests have some pretty arbitrary grammars like

S ::= <A>/<B>

A ::= <A> ‘a’ / <B> / ‘a’

B ::= <B> ‘b’ / <A> / ‘b’

and our parser handles these admirably and with excellent greed.

I've been wondering what to do with this information; I drafted an e-mail to Warth/Douglass/Millstein but haven't sent it.

[1] http://www.google.com/url?sa=t&source=web&cd=1&v...


It actually sounds more like cancellation parsing than frost's approach

(which iirc was more that left recursion is bounded by input length)


I have a couple of loosely connected points and links:

You want to buy 'Parsing Techniques 2nd Ed' -- this book is fantastic and will give you the understanding and clarity you seek. It is wonderfully comprehensive.

And schmitz' thesis is an excellent read on tools and techniques to deal with ambiguity http://www.lsv.ens-cachan.fr/~schmitz/papers

and the non-canonical parser that results is quite interesting too :-)

PEGs vs Boolean Grammars: I think adding negation and conjunction to a grammar is the same as positive and negative lookahead

PEGS + Ambiguity: Like other people, I solved this in my last top down parser by attaching both precedences and relations to terms

i.e E[50] :== E[< 50] + E[<= 50]

so you can force a left or right associative operator.

That said, although determinism is nice, I think ambiguous parses are inevitable once you have a sufficient level error handling within the parser

parsing theory is pretty developed and established but in reality many parsers are cobbled together top down parsers.

you can't use existing parse libraries to handle HTTP for example.

(That said, yakker is an approach with earley parsing to handle data dependent grammars)


and while i'm here - the original earley paper is full of bugs.for a modern treatment you may find aycock & horspool's work on 'practical earley parsing' interesting, as well as the work on the Marpa Parser in perl.

to me: the earley parser and the packrat parser are similar in nature - they are both chart parsers. one is depth first, one is breadth first.

my question: can you engineer an earley parser that supports ordered choice and exploits this to avoid broadening the search too early -- ie can earley parsers support PEGs

I think it is possible and when i'm not drowning in my startup work i'd like to take a longer look at it.

fwiw my terrible attempts at an earley recognizer are here http://pastie.org/private/uezi1mxiur8dymdbh2scw

i've tried to use some ideas from the aycock and the aretz variants of the earley parser - avoiding lists of earley items at a character - instead storing the reductions/completions seperately from the scanning items, and using a driver rather than the predictor/completor loop (still need to add nullable support...)


I'm not sure the Earley paper "is full of bugs" - to the best of my knowledge, it has only one bug (albeit a biggy), which is in its description of how to create a parse tree.


it isn't cubic time either iirc


Yes, that's part of the same bug, as I understand it anyway (to be clear: I'm not sure that anyone's claiming that there are problems with Earley's recogniser, which is really the core of his paper). Scott's "SPPF-Style Parsing From Earley Recognisers" gives an overview of the bug and its implications.


I can't remember off hand if it dealt with nullable terms or hidden left recursion properly either.

don't get me wrong: I like the earley parser :-) I just think the original paper has some omissions and the treatment of earley parsing in 'parsing techniques 2nd ed' is thorough and includes substantial references and explanations of further work.


isn't the mongrel/thin/unicorn http parser library built using ragel (i.e. an existing parse library)? Could you expand on why you think it's not possible to handle http with existing tools?


ragel is for writing state machines and automata in.

many parsers are written as automata, but that does mean it is in the same category as parsing tools such as LR, LL, GLR, PEG or CFG based approaches.


yes, but the generated state machine is able to recognize a regular language, so if the automaton can recognize http it would mean that it is a regular language, and existing parsing tools should be able to deal with it, or am I missing something?


regular expressions (ala cs) are equiv to finite state machines. regular expressions can't count or match ()'s. ragel allows you to mix in code within the state machine, so it is actually far, far more powerful than a finite state machine.

in http, handling things like 'Content-Length: %d' and then reading a subsequent length is a little harder. as is handling transfer-chunked encoding. http is quite fiendish in places :-)

these are 'data dependent' - the parse stream contains information (i.e length) on the subsequent tokens - although some regexps have back references, these aren't common place in parsing tools/formalisms like LR,LL,LALR,CFG or PEGs.

my point is simply that a lot of the parsing drama of late has revolved around the simple task of parsing a language, rather than parsing network protocols.

there is a larger class of parsing problems that are still to be tackled.


Ah, now I get it what you meant, thanks for the explanation


Speaking of composeable grammars, if you were not already aware JetBrains (maker of the wholly incredible, and now Open Source, IntelliJ) has a language platform which is targeting just that kind of thing.

They call it Meta Programming System (http://www.jetbrains.com/mps/) and it is also Open Source.

Be forewarned that MPS is almost a closed system. One must develop the grammar only in MPS (no textual grammar to speak of), and the editor for the language described by that grammar is then defined using this bizarre form-layout language (also edited in MPS). One cannot (that I am aware of) take the newly created MPS system for that language and point it at your favorite source code text file. It does have a seemingly powerful code generation system, so it might be possible to generate the textual grammar if you were so inclined. However, the text generator also uses a bizarre language so that is likely more pain than one would wish to endure.

The advantage of a system such as that is that there is only The One True Brace Style(tm) because the "braces" and other punctuation are merely icons to please the eye; they are not "editable" like in Emacs.


Indeed, the allusion to syntax directed editing in the article is due to MPS.


I'd also say that the hardest part isn't the parsing itself per se, it's error reporting and recovery. Showing parse errors as someone types in an IDE, for example, requires an order of magnitude more sophistication than simply handling valid productions does.


This is where recursive decent parsing really shines. When an invalid sequence is detected deep in the recursion, you don't have enough context to generate the best error message but as you unwind, you can add additional context such that by the time you're at the top of the call chain, you can generate very high quality error messages.

Of course, you have to be careful about dealing with recursion that's too deep. Large input files and/or grammars that require a lot of back tracking can make it a very slow parser too.

But for quite a lot of things, it's very appropriate. It's pretty much my go to mechanism when I need a custom parser.


Derivative parsing http://mjtsai.com/blog/2010/12/13/derivative-parsing/ looks interesting.

It has several nice properties, including being composable and incremental. And it supports modifying the grammar on the fly.


It has several bad properties too. The algorithm they outline is exponential.

The challenge to make it run in cubic time will certainly turn it into a GLR or an Earley parser. (They can be formed from each other)

To me: it is just another way of looking at Dotted LR-Items used within earley parser. Earley parsers keep a state made up of grammar rules with a dot indicating where in the rule the parser is at.

(I wrote an earley recognizer that uses a derivative style interface to grammar rules.)

It has novelty, and some utility but it is not the holy grail you are looking for.


> The algorithm they outline is exponential.

They claim that it's actually roughly linear (given some fairly obvious optimizations) and provide a table of measurements that support said claim.

http://arxiv.org/PS_cache/arxiv/pdf/1010/1010.5023v1.pdf

Number of tokens Parsing time

4,944 10 ms

42,346 43 ms

390,553 326 ms

5,049,213 3.9 s

22,155,402 17.1 s


I agree completely with the premise, which is that Parsing looks like a solved problem but really isn't.

One thing that should be a red flag is that lots of people still write parsers manually instead of using tools. GCC migrated from a yacc/bison-based parser to a hand-written recursive descent parser. I used to work in the same office as Bob Jervis who wrote Turbo C back in the day, and he said he never uses parser generators. Clearly something is wrong if the tools that have all the weight of the theory and formalisms are losing to coding parsers by hand.

I disagree that because computers are so much faster we should use less-efficient parsing algorithms that can handle all grammars (like Earley, GLR, etc). One reason I disagree is the ambiguity problem given in the article. The very process of building parse tables for a more restricted language subset like LL, LR, etc. can tell you that a grammar is not ambiguous, which is hugely beneficial. If you do not restrict yourself to such a subset, you never know whether your grammar has ambiguities or not. To me, this would be like programming in a language that doesn't give you syntax errors when you make a mistake. Sure, the syntax errors are annoying, but they are your only way of knowing you need to fix something!

(By the way, PEGs have the same issue -- by using prioritized choice, you never know when you're covering up an ambiguity. An example of this is C's famous "if/else" ambiguity. It's important to know that this ambiguity exists, because you have to tell your users about it.)

Instead of using our extra CPU power to parse with less efficient algorithms, I think we should use our extra CPU power to integrate parsing more deeply into our editors/IDEs. With a fast parsing algorithm, you could parse as a programmer types, unlike the current syntax highlighting editors which use only rough pattern matching instead of having a real parse tree. Imagine seeing syntax errors the moment you type them. Imagine having your cursor inside a struct and having a little "inspector" pane off to the side that says "sizeof(struct MyStruct) = 96". Imagine getting real, context-sensitive completion even for complex languages like C++. That's what we should be spending CPU cycles on IMO.

The biggest thing that could be improved about parsing IMO is for grammars to be reusable. There is no reason that people should have to roll their own parser for a common language. There's no way you're going to get all the edge cases right unless you spend a LOT of time on it, but why should you have to do that? IMO there should be a "grammar library" the same way that platforms like the JVM or .NET have a "class library." You should be able to take a C++ parser off-the-shelf and use it in your program, thereby saving yourself literally man-years of work.

I've been working for a long time on a project that seeks to achieve this goal. It's tabled right now because of my work on a protocol buffer implementation (the two will be related in a deep way; protocol buffers are an ideal way of specifying parse trees, which are the output of a parser), but I am extremely motivated to return to it.

http://www.reverberate.org/gazelle/ https://github.com/haberman/upb/wiki


Delphi's IDE implements its code completion etc. using the real compiler parser, and has done for over a decade. The chief "tricks" are skipping method bodies other than the one the cursor is in, and otherwise having a fast parser (hand-written recursive descent / operator precedence hybrid in Delphi's case).

But completion / analysis etc. is still triggered on a pause after typing - it's not fast enough to run on every keystroke, not least because of basic reasons like having to page in referenced modules.

I agree that parsers or grammars should be reusable. I'd put parsers ahead of grammars, because frequently in practice there are some ambiguities around the edges in production grammars that are resolved using type and semantic analysis. What you want is something which can turn the text into a parse tree with the thorniest ambiguities already resolved.

Of course, doing interesting analysis with that tree is still pretty awkward if you have things like overloads in your language, to take a single example which needs pretty complete type analysis to resolve correctly in all cases.


But completion / analysis etc. is still triggered on a pause after typing - it's not fast enough to run on every keystroke, not least because of basic reasons like having to page in referenced modules.

Well, not to mention that you wouldn't want to be corrected after each keystroke anyway. "Yes, I know 'whil' is not correct, that's "why" I'm about to type an 'e'".

I actually wish there was more of a delay in Qt Creator's syntax highlighting - of course the stuff I'm working right now isn't correct yet.


With a fast parsing algorithm, you could parse as a programmer types, unlike the current syntax highlighting editors which use only rough pattern matching instead of having a real parse tree. Imagine seeing syntax errors the moment you type them.

That's what Xcode 4 does with the new Clang/LLVM machinery inside.


Instead of using our extra CPU power to parse with less efficient algorithms, I think we should use our extra CPU power to integrate parsing more deeply into our editors/IDEs. With a fast parsing algorithm, you could parse as a programmer types, unlike the current syntax highlighting editors which use only rough pattern matching instead of having a real parse tree.

Don't Eclipse/Visual Studio/Emacs Semantic/etc. already do this? The struct trick you mention would be new to me, but unless I'm missing something, seeing syntax errors the moment you type them and getting context-sensitive completion are features that have been available since before I started programming (in 2002).


ambiguity is useful for error recovery/error detection. also, some languages have ambiguity in their syntax (ML). I don't buy the 'optimization' argument. there is no reason we cannot have our cake and eat it - ambiguity and incremental parsing.

as for incremental parsing in ides, you may enjoy this thesis: http://jeff.over.bz/papers/2006/ms-thesis.pdf

finally;

I think parser-generators are unpopular because people would prefer to just write code, rather than compile something else to automatically generated code that is nigh on unreadable.

I think the popularity of regexes is due in part to the ease of which they can be embedded or used within the host language - either with syntactic sugar, or simply as a library.

combinators (parsec especially) hit a sweet spot of being able to write code that handles parsing succinctly, without having to conflate your build or auto-generate source.

i'd really prefer a library I can load and manipulate the grammar from, over yet another syntax and compiler in my tool chain.

(ps. (i'm saddened by the lack of left recursion support in gazelle))


> I think parser-generators are unpopular because people would prefer to just write code, rather than compile something else to automatically generated code that is nigh on unreadable.

I agree that generating source code is annoying, which is why Gazelle does not do it. It takes a VM approach instead; the parser is either interpreted or JIT-ted.

> I think the popularity of regexes is due in part to the ease of which they can be embedded or used within the host language - either with syntactic sugar, or simply as a library.

That is exactly what Gazelle is trying to do.

> i'd really prefer a library I can load and manipulate the grammar from, over yet another syntax and compiler in my tool chain.

Gazelle always loads its grammars at runtime. There's a compiler also, but it just generates byte-code that the runtime loads. But you can run the compiler at run-time too if you want.

If you'd rather build a grammar programmatically than use a syntax meant for it, more power to you (Gazelle will support it). But that doesn't seem to match your regex case: people specify regexes with a special syntax, not by building up an expression tree manually. The latter seems like a lot of work to me, and such grammars will not be reusable from other languages, but that might not be important to you.

> (ps. (i'm saddened by the lack of left recursion support in gazelle))

What is a case where you would really miss it, that isn't addressed by a repetition operator (*) or an operator-precedence parser?


I would like to say: awesome!

And yes most of my left recursion fetish would be covered by an operator precedence parser/left corner parser


> I think parser-generators are unpopular because people would prefer to just write code, rather than compile something else to automatically generated code that is nigh on unreadable.

Also, a manual lexer/parser can introduce context when necessary. E.g. Python has significant whitespace. The lexer (with context knowledge of indentation width) can easily emit indent/dedent tokens, so the grammar is context free. With a tool like ANTLR you have to do wierd stuff to parse Python.

Programming languages older than 10 years or so are usually not context-free. For example C needs context to parse "A*B", because the meaning depends on whether "A" is a type or a variable. Recent programming languages usually try to be LL(1), which is why the keywords "var", "val", and "def" become so popular.


> The biggest thing that could be improved about parsing IMO is for grammars to be reusable.

This is not solely about parsing. There is the semantic analysis phase afterwards, which is just as hard. In your example "sizeof(struct MyStruct) = 96" must be related to the type definition of MyStruct. Then the compilation target must be asked for the sizes of the primitive types inside the struct. Effectively, the IDE could as well execute the compiler (with a parameter not to generate code) and parse the output.

Maybe an IDE wants a complete frontend as a library. Feed in the file, retrieve the AST. Insert callbacks for warnings and errors instead of printing to stderr.

An interesting problem would be to reparse parts of a file. The parser would have to get an AST node and a file position, to "resume" parsing from there. This only works, if there is no context apart from the AST, though.


It's nice to see such optimism yet I have to (hopefully respectfully) criticize to your approach.

I used to work in the same office as Bob Jervis who wrote Turbo C back in the day, and he said he never uses parser generators. Clearly something is wrong if the tools that have all the weight of the theory and formalisms are losing to coding parsers by hand.

When a number of really smart people do something, there might just be reason for it. They might be wrong too but if you're going to call them wrong, perhaps you should first take the time to figure why they do what they do first.

One thing that should be a red flag is that lots of people still write parsers manually instead of using tools.

Wait! People write programs manually too instead of automatic program constructors. It might be OK.

The thing about "manually" constructing a parser is that it not a terribly difficult task once one has a strong grasp of the relation between languages specification and recursion. Manually constructing a recursive descent parser also doesn't involve throwing out the formal specification of the language. Rather, you transform your language specification until you shape it into a set of productions which can be captured by a series of recursive function calls (essentially taking the language to Greibach Normal Form but there are a number of short cuts you can take once you're comfortable with this process). It is not a ad-hoc affair but it does allow you make add ad-hoc optimizations for the particular actions that are needed or not-needed in the results of your parsing.

The road block to any reusability in parsers is that a language's grammar exists at a different logical level than a single function, object or piece of data.

Specifically, two programs that parse the same grammar would likely need to do different things at different times in the parsing process. The simplest thing to create the code for each of these programs separately and then let the compiler optimize the code, tossing out the things I don't need.

But any "general purpose" parser library more or less has taken the code and fully compile it - turn it into an AST or the equivalent and then let the programmer deal with the results (the way Python or Ruby do reflection). Such an approach work for some things but it inherently isn't going to be very fast because it doesn't which parts of the code the final user is not concerned with.

Just consider. The simple way of removing a single "i" tag from strings like "<i>help me</i> he said" is never going to be "load XML parser, issue commands, close XML parser". A single loop removing the offending character is going to beat this "algorithm" a thousand times over. Manual parser generation more or less allows you to boil down your parser to this.

Qt creator finds syntax errors as I type and I remember Visual Studio doing the same. I don't think either uses the same parser as their respective compilers because they have to be more forgiving - otherwise they would fail to arrive at the piece of code you're actually editing but I could be wrong. I know most syntax highlighting is custom written but again, isn't that hard to custom write.


> They might be wrong too but if you're going to call them wrong, perhaps you should first take the time to figure why they do what they do first.

Right back at'cha. You have criticized my approach without understanding it.

> Specifically, two programs that parse the same grammar would likely need to do different things at different times in the parsing process. [...] But any "general purpose" parser library more or less has taken the code and fully compile it - turn it into an AST or the equivalent and then let the programmer deal with the results

Actually, that's not what Gazelle does. If that were the best you could do, I'd agree with you.

At its lowest layer, Gazelle offers event-based parsing. It's like SAX, where the approach you have described is like DOM.

The way you would remove a single "i" tag from a string is by registering a callback that gets called when "i" is encountered, and another callback that gets called when the corresponding closing tag is encountered.

And once I've optimized Gazelle to generate really fast machine code, I doubt your "single loop" is going to beat me. For example, did you know that with SSE you can do 16 byte-wise compares in a single instruction? Are you going to put SSE instructions in your "single loop"?

Finally, your "single loop" is probably not going to correctly handle the case where the "i" tag is inside a CDATA section, and is therefore not actually an "i" tag at all. This is the biggest problem with ad hoc parsers -- nobody seems to realize how incomplete/buggy they are.

> Wait! People write programs manually too instead of automatic program constructors. It might be OK.

I think a better analogy would be: you would know C compilers weren't good enough if people were always writing in assembly instead. Writing a parser manually instead of using an abstraction like a CFG is like writing in assembly language.


> Just consider. The simple way of removing a single "i" tag from strings like "<i>help me</i> he said" is never going to be "load XML parser, issue commands, close XML parser". A single loop removing the offending character is going to beat this "algorithm" a thousand times over. Manual parser generation more or less allows you to boil down your parser to this.

However, the parser-way might enable you code to handle tricky XML stuff. E.g. namespaces: "<html:i>help me</html:i>".


Either can be useful depending on the situation.

A just-in-time compiler would be able to do the quick loop also if this loop ran somewhat frequently.

The point is each of these approaches makes sense in a given context and so you can't say that any of them are relative "hacks" or "stone-age" or whatever.

I could imagine a future language/system that would allow a few XML-parser commands on a line to be translated into a single loop. But we aren't there yet - the solution isn't in just algorithms but the whole development. Haberman's "program" of improving parsing is noble. The problem is he's articulated as simply bolting algorithms in and you need more than that (ie, you need a language that can lots of compile-time futzing about).


Visual Studio is very often confused at edit-time. The parser should not be more forgiving, it should just remember the parts that once were correct but are now invalidated by further edits. That way, updates that keep local correctness would continue to work even if some parts prevented the correct parsing of the whole file.


> Imagine seeing syntax errors the moment you type them.

Eclipse does that. Or are you talking specifically only about C?


Resharper for Visual Studio (a refactoring tool) uses Antlr to parse the C# code.


There is an off-the-shelf C++ parser. IIRC it costs about $100k per year.


I forgot the name, but this is it:

http://www.edg.com/index.php?location=c_frontend


On a more theoretical point: The source points out that parsing of CFGs is still not solved. But CFGs are themselves pretty narrow types of grammars; fully general parsing is even further away, since we are probably going to need a near-human specialized AI for that.


I could argue that if you are, for any reason, creating a formal computer language which requires more than just a CFG to parse, something has gone wrong somewhere. CFGs are narrow in comparison with general grammars but that doesn't stop them from being (1) exceptionally powerful and (2) exceptionally well-understood.


I'm designing a programming language and I totally embraced the PEG paradigm. I don't care about left recursion. My grammar is simple and clear without it.

A great benefit of PEGs is that I can easily reason about how the parsing will be done as I write the grammar, which will be the same if it's done by the compiler or by the programmer in his head!

Those who say that the ambiguities still exist when using PEGs say so because they still reason in terms of CFGs. There is no ambiguity in PEGs, period. If the language is defined in terms of PEGs, there's no worry about that, as long as the reader of the language spec understands about prioritized choices.


A wonderful, human survey of the field, with pleasantly little jargon. Thanks ltratt!


What mature tooling is around for PEG investigation?


PEG for Lua is pretty easy to use and experiment with: http://www.inf.puc-rio.br/~roberto/lpeg/lpeg.html


notably: lpeg uses backtracking over packrat parsing


Packrat parsing has fairly horrible memory usage. LPEG is amazingly frugal in terms of code size, run time, and memory usage.

It's worth noting that the PEG ordered choice operator leads to much more restricted backtracking than occurs in most regex implementations, so LPEG has much more predictable performance. And unlike a DFA matched you still get straightforward capture handling.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: