Hacker News new | past | comments | ask | show | jobs | submit login
Parsing and all that (jeffsmits.net)
63 points by foresterre 9 months ago | hide | past | favorite | 39 comments



Combining a finite state machine for recognizing sentential patterns, with a stack, gives us LR parsing.

But push-down automata are significant not only because they have practical uses in parsing, but because they represent a theoretical class. They are more powerful than finite automata, but are not Turing complete.

If, say, we use a push-down automaton to make a Boolean decision: does this input string fall into the to-be-recognized set, or not? then there are some kinds of strings it won't be able to decide, that a full Turing machine could decide.

The limitation of push down automata is a direct consequence of the stack discipline. There is only one stack, and when the stack is popped, information is thrown away.

The tape machine that Turing described as part of developing his theory of computation doesn't have that limitation; when the tape head moves backwards, material it has written to the tape is not erased in a stack-like discipline.


Neat observation: finite automata, push-down automata, and Turing machines can be distinguished purely by the number of stacks they need.

Finite automata: no stack.

Pushdown automata: one stack.

Turing machine: two stacks. (Use them like the tape: to move forward, pop the f-stack and push the symbol on the b-stack; to move backward pop the b-stack and push to the f-stack.)


I think it's a shame this wasn't highlighted more explicitly in, at least, my computer science degree.

We did finite state machines in quite a bit of detail, vaguely referred to pushdown automata but then concentrated on lambda calculus for quite a lot, with Turing machines less so.

The Turing machine description, without showing it in this hierarchy, seems like a really weird abstraction. Fitting it in here (and combining it with the different Chomsky classes of grammar) makes it all feel more joined up.


That was the way I learned it. I found the lambda calculus the odd one out. :-)


What else can they do? Has anyone used them outside of parsing, as a way to satisfy the principle of least power?


Michael Jackson's single-threaded "Jackson Structured Programming"?


There is a part of me that feels we should just forget that LR ever existed and insist on LL(k) for k <= 2 for all new languages. The supposed expressivity benefits are not worth the complexity cost in my not so humble opinion. For me, the best approach is to write the grammar in a dsl that checks that it is unambiguous and in LL(k) and then implement by hand with recursive descent.


I tend to agree. Having done a bunch of DSL's over the years, and becoming extremely fluent with Bison+FLEX, I have to say that a hand-written LL parser with a couple of mutually recursive table-driving functions to parse expressions via precedence climbing is the way to go. Precedence climbing (or I think it is also called Pratt parsing?) is super simple and flexible. But the BIG bonus is in error reporting, and let's face it, most of the user interaction with your parser is with the error messages, otherwise it is push 'enter' and forget. Getting a reasonable error message out of an LR parser is about as much fun as poking yourself in the eye repeatedly with a sharp stick. With LL, you know what structure you are trying to build... with LR, you have to rummage around in parse stack looking for obscure clues.


I am aware this is an unpopular opinion, but I think the idea that recursive descent is inherently better suited to error reporting than table-driven LR is massively cargo-culted. You can handcraft error messages for an LR parser if you overspecify the grammar, meaning that you don't just write productions for the actual language; you also write productions for likely errors and have the predicates for those productions report your handcrafted error message.

I'm guessing that a lot of people's first impression of that will be "That's insane, are you seriously suggesting writing two grammars in one?" But that's exactly what you do when you build custom error reporting code into a recursive descent parser. It's just less obvious because the code is procedural instead of declarative.


I see the appeal of adding error nodes, but I'm not a huge fan. If you do this rather than either bombing out immediately on the first error or storing a stack of errors, then the syntax errors pollute the generated tree. This will require you to either add a separate pass for building a new tree that is the same as the old except without the error nodes, or you will have to handle syntax errors downstream in the analysis or code gen phase.

A recursive descent parser does not have to be procedural. I have written one in a functional language that I designed that has no mutable variables or loops (recursive functions handle both of these cases).

Parser generators are great for iterating on language syntax design. Once you know the language syntax, handwriting the parser should be easy. If it isn't, then it is most likely that either your language design or parsing skills need work. Both require time and practice.


> If you do this rather than either bombing out immediately on the first error or storing a stack of errors, then the syntax errors pollute the generated tree. This will require you to either add a separate pass for building a new tree that is the same as the old except without the error nodes, or you will have to handle syntax errors downstream in the analysis or code gen phase.

I don't see how this follows? There is no need to append an error node (or any node at all) to the AST when reducing an error production. You can just print an error message or push one onto a stack of error messages.

> Parser generators are great for iterating on language syntax design. Once you know the language syntax, handwriting the parser should be easy. If it isn't, then it is most likely that either your language design or parsing skills need work. Both require time and practice.

Parser generators aren't just for people who don't know how to write recursive descent parers; they are production ready tools. Python, Ruby, PHP, and Bash all use generated parsers.

I think the aversion to parser generators in general and LR parser generators in particular is a combination of not-invented-here-syndrome and cargo-culting the idea that LR parsers can't emit handcrafted error messages.


Yes, that’s also not uncommon for lexical grammars to haven token classes that are more general than what is actually valid, for example for numerical constants.

What would be useful is tooling that would check that one grammar is a sub-language of another grammar which is suitably connected (because in the general case that check is undecidable of course).


That's a very interesting way of looking at errors.

Wouldn’t the grammar of errors have to also be LR? Doesn’t that limit the kind of errors you can report?


Yes, it does have to be LR. Which can be a real PITA if you're working with a restrictive subset of LR, like LALR. But there are other options too. LR(1) is pretty powerful, and GLR even more so.


I utterly reject hand-written parsers. You give up the ability to trust your grammar - all too often you can accidentally add an ambiguity.

I like `bison --xml` for the fact that you can write your own code but trust the table.


Do you utterly reject essentially every mainstream language? Which languages do you use that do not use a handwritten parser?


Well, IMO, but you write a parser only a couple of times for each language, but you read code all the time.

So, anything that makes code easier to read is a must. That includes the extra expressiveness of an LR grammar.


I agree there are a few places like this, but they're usually solved when you want an LL parser for as much as possible of the language by either making them part of the tokenization or by "escaping" a small subset of the grammar.

I wish that language designers (looking at you, Ruby - as much as I love using Ruby, the grammar is an abomination) would at least try to stick to LL as much as possible, and consciously, carefully and in a limited fashion "escape" into LR for as small subsets as possible.

Break rules if it actually helps developers, sure, but it'd be nice if people tried to make the weird bits as tiny as possible...

EDIT: Also, in practice this is often what we end up doing when we write recursive descent anyway, only in a less than formal way. It'd be nice if more tooling let us write grammars that default to LL w/checking, but let you insert clearly delineated exceptions.


Oh, I do agree with that!

A language should be LL as much as possible. And also depend on the parsed data as little as possible. Stuff like C's variable declaration must be avoided.

But I don't think either of those rules can be made universal. Some language for expressing exceptions would be great.


Do you know of any particularly readable constructs that are possible with LR but incompatible with LL?

My general opinion is that any constructs that confuse a parser will also confuse a human.


Operators with precedence rules.

And given the modern idea of "everything is either an identifier or an operator", precedence became very important.


Operators with precedence rules frequently appear in top-down parsers. I agree that pure-LL form doesn't make this easy, but most languages I have seen use Pratt parsers that allow precedence parsing within a top-down framework. It doesn't require full LR.


Pratt parsing is essentially hand-written LR. It's certainly bottom-up in the same way, despite Pratt's paper title.

(I went into more depth on this a few years ago: https://www.abubalay.com/blog/2021/12/31/lr-control-flow)


Hm I wouldn't call it bottom up -- I would say it's top down, but you consult a table of integers to decide where to recurse.

I would even call it a variant of recursive descent, which is top-down. In plain recursive descent, you often choose where to recurse based on the first token (e.g. if while for). In Pratt, you choose based on a table indexed by the second operater token.

You can have an an entire arbitrary length subexpression before the "second token", but it is still resolved "eagerly".

e.g. consulting the loop here - https://github.com/andychu/pratt-parsing-demo/blob/master/td...

I skimmed over your post, but I don't really see the bottom-up argument.

I guess I could concede that it's neither top down nor bottom up, but I don't see the argument for being bottom up.

And maybe this is just a taxonomy thing, which doesn't matter in practice. It could be understood both ways -- I don't think Vaughan Pratt was mistaken when he understood it a certain way :)

---

"where to recurse" is basically choosing the parent's production before its children -- that's most similar to top down

If you choose the child's production before its parent, then that's LR / bottom-up

(using Haberman's explanation - https://blog.reverberate.org/2013/07/ll-and-lr-parsing-demys...)

May also be relevant:

https://www.oilshell.org/blog/2017/03/31.html

https://www.oilshell.org/blog/2017/04/22.html

https://matklad.github.io/2020/04/15/from-pratt-to-dijkstra....

I believe some people claimed that Dijikstra is bottom-up and Pratt is top down. That has a certain nice symmetry to it, but I'm not sure it's true :) Dijikstra does use an explicit stack and Pratt uses the call stack -- maybe that is how I'd put it.


The fact that you parse an entire subtree before you know the kind of its parent is exactly what makes it bottom-up. Pratt is certainly easy to integrate into an otherwise-top-down parser, because you can enter it from the top and it can enter its "leaves" from the top, but internally it is absolutely bottom-up, and you can generalize this to any recursive ascent parser.

To make this more explicit:

The first thing a Pratt parser does before entering the loop is based on the lookahead token's "nud." This corresponds to an LR automaton in its initial state shifting that token and advancing into the leaf-most rule(s). The "nud" typically switches back to top-down internally, where pure bottom-up would not, but the choice of "nud" is necessarily made in a bottom-up way.

Once "nud" returns, or the LR automaton reduces back to the initial state, the next thing it does is based on the operator token's "led." This corresponds to the LR automaton shifting that token and advancing into the parent rule determined by the operator. This resolution is just as "eager" in LR, where it is represented by the new state containing only items from the chosen rule. This is the hallmark of bottom-up parsing: as you progress through the children, you narrow down the possibilities for the parent.

Finally, note that none of this has anything to do with the use of function calls vs a table of integers and explicit stack. Both top-down and bottom-up parsers can be implemented using either recursion or a table and stack.


OK interesting, yeah I do see that if you consider how these expressions are parsed

    1 + 1 * 42
    1 + 1 + 1 * 42
    1 + 1 + 1 + ... * 42
Or simply right associative operators.

I'm not sure how much the classification matters, similar to how I've concluded that pratt parsing vs. shunting yard doesn't seem to "matter" (both work)

But I have heard the "complaint" that pratt parsing is hard to understand, so maybe explaining it as a bottom up could help.

The arbitrary length subexpression prefix is really the problem that LR solves compared to LL (historically Python used LL parsing, and param=value and lvalue=rvalue are "overparsed" for this reason)


Yeah- I tend to think of the whole vague collection of LR + shunting yard + Pratt as one single approach in my head, with the same capabilities in terms of parsing.

Sometimes working through a particular problem, or conflict, or example is easier from one than the others, but knowing how they correspond (which is similar to how iteration and recursion correspond) means you can translate those ideas across.


In 2012 yes, but these days having grammars that lots of tools can parse and output easily is a boon for tooling.


I've always been hesitant about LL(1) since it requires a hack for something as common as expression-parsing, whereas LR is fine. Note that the linked article is misleading about LR having a problem with right-recursion: it only affects (data) stack depth, rather than breaking the machine like the LL problem. Still, LL and LR have common ground against the heresy of backtracking at least.

I've never met a sane grammar that's not LALR(1). It's unfortunate that LALR(1) is such a leap of complexity beyond SLR(1), which is not practical for real languages.

I imagine that LL(1) "mostly" works after the expression hack - what's your particular motivation for LL(2)? A lot of weird grammars lead to code that's hard to parse for humans.

(Note also that, for languages (Python, Javascript, etc.) that lack an efficient `switch` statement, table-based is likely to beat function-call-based code)


Needing two tokens of lookahead is fairly common. For example, in Lua "local IDENTIFIER" is a local variable declaration and "local function" is a private function declaration.


How relevant are Context Free grammars for modern programming languages? It seems that most modern languages want things like “raw strings” where you can deal with arbitrary strings by using syntax like (Rust) `r###` (however many `#` you need). It is my understanding that syntax like that makes these grammars fall outside of Context Free, no?


That sort of thing is typically handled during lexing. By the time the parser deals with it, it's a string literal. This is one of the advantages of splitting those into separate phases.

There are other hacks in use as well. Haskell's syntax is context-sensitive because operator precedence and associativity can be declared in the same context as the operator is used. But GHC ignores that during the phase where it actually parses the input. It just ignores precedence and associativity when parsing operators. After it finishes parsing the translation unit, another phase goes back, finds nested operator applications in the parse tree, and fixes them up based on the precedence and associativity information it has found.

It turns out that when your language is only just barely context-sensitive, you can find a way to use context-free tools to handle it, with a couple extra passes to fix things up.


Features like that technically make the language context-sensitive, but in practice that context sensitivity is entirely contained within the lexer, and you can still describe the rest of the language with a context free grammar of tokens.


Okay, thanks. I think that’s good.


The actual place where CFGs fail tend to be if types and expressions aren't parsed the same. This especially causes problems if your language's generics use `<>`, but even C has the `T * v` problem.


In traditional programming languages the origin of such problems is the straightjacket of the ASCII character set.

When a programming language is designed now, the right way is to use distinct Unicode characters wherever necessary, easily avoiding all ambiguities.


I would like to agree, but since keyboards are still mostly ASCII (in particular with regard to punctuation characters), most people don’t have a Compose key configured, and also don’t want to be constrained to editors with an input method tailored to that programming language, it would be an uphill battle for adoption.


Interestingly though even C had the “not on my keyboard” problem, hence the trigraphs. And still they ended up having to use “lexer hacks”.

Modern languages could use a more legible (unlike C trigraphs) solution as write-only alternatives (a canonical formatter could normalize). Since modern language can now use the whole (woah) printable ASCII range. :)


The trigraphs feature was a committee invention for something that had already largely stopped being a problem at the time it was decided. Nobody really used them, other than by mistake, which is why they were now removed in C23 (and already earlier in C++).

A write-only solution is virtually the same as an input method, or you have to support different representations in source code after all. It also doesn’t fix all syntax ambiguities. For example you may want to allow “<<“ as an ASCII version of “«”, but then you still have possible ambiguities with consecutive “<“s (does “<<< … >>>” mean “<« … »>” or “«< … >»”?). And it has the drawback that you have to learn how to type the symbols that look differently from their ASCII sequence counterpart in the canonical version.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: