LR Parsing: More Elegant Than You Think

dbcurtis · on June 4, 2018

1. Web site is just a blank white page on mobile Safari.

2. LR parsing is elegant in the mathematical sense of the word, but use in production leads to regret. The problem with bottom-up parsers of any flavor, LR included, is that when you encounter a syntax error there is no context information, or context is difficult to infer, which makes creating a sensible error message less fun than poking yourself in the eye repeatedly with a sharp stick. I contend that giving good error messages is the most important factor in user-friendliness of any parser.

Look up "precedence climbing" -- a recursive descent parser with expressions handled by precedence climbing is going be an easier path to a production parser with good diagnostics, IMHO.

cwzwarich · on June 4, 2018

> The problem with bottom-up parsers of any flavor, LR included, is that when you encounter a syntax error there is no context information, or context is difficult to infer, which makes creating a sensible error message less fun than poking yourself in the eye repeatedly with a sharp stick.

Both LL and LR parsers share a fundamental flaw for error reporting; they are both prefix parsers. They will parse the input until it fails to be a valid prefix of the language represented by the grammar. However, it is rare that this coincides with the place where the actual error was made.

The difference in information available here at the end of the valid prefix is less about the difference between LL and LR parsing and more about the difference between LL and LR grammars. An LR parser parsing an LL grammar is able to recover the same information as an LL parser at the point of error. And for non-LL grammars, it is easy to extend an LR parser with additional context information by splitting states (see http://gallium.inria.fr/~fpottier/publis/fpottier-reachabili...).

Parsing algorithms that do not produce prefix parsers (e.g. precedence parsing) are often able to produce more logical error messages with less work, because they continue making actions after the input is no longer a valid prefix, and this can amount to gathering more information about the error. For the same reason, LR-family parsers that perform default reductions (e.g. LALR parsers, or any minimal state LR parser) often perform reductions (but not shifts) after the point of error, and produce better error messages than canonical LR parsers.

My favorite approach to error recovery is Richter's approach of error intervals (https://dl.acm.org/citation.cfm?id=4019), which uses alternating prefix/suffix parsers to find minimal error regions in the input that can not occur as a substring of any valid input. This has no dependence on the grammar or parsing technique. It was not (widely?) known at the time that Richter wrote his paper, but suffix parsing of LR grammars is linear time.

dataflow · on June 4, 2018

I'm unfamiliar with precedence parsing but what attracted me to (G)LR was its running time. What is the running time guarantee of a precedence parser for LR(k) and general (including potentially ambiguous) context-free grammars?

cwzwarich · on June 4, 2018

Operator precedence parsing is linear time, but it only produces a parser for a very limited set of grammars, the epsilon-free LR(1) grammars with no consecutive nonterminals.

Precedence parsing is typically combined with another parsing method to parse a non-operator part of a grammar, or it is used with preprocessing filters that insert extra marker tokens to work around the limitations.

Most programming languages have grammars that are very close to being operator precedence languages, so in the early days of parsing there were a lot of attempts to extend precedence parsing to handle more general grammars. After SLR/LALR parsing was developed, most of this work stopped.

dataflow · on June 4, 2018

Thanks! That sounds awfully limiting if I'm being honest, even if a language is 'mostly' well-behaved... I can't say I can see how you could use preprocessing to get around the limitations since it would seem that doing so properly would require already parsing the input. It sounds like a nice method for when it works but it doesn't really sound like it would substitute for LR?

oso2k · on June 6, 2018

Not really limiting. Crockford used it in jslint [0][1][2] to parse JavaScript, and OIL Shell also uses it [3] to parse a shell language.

[0] https://youtu.be/Nlqv6NtBXcA [1] https://crockford.com/javascript/tdop/tdop.html [2] https://en.wikipedia.org/wiki/Operator-precedence_parser [3] https://www.oilshell.org/blog/2017/03/31.html

dbcurtis · on June 4, 2018

Interesting. I will have to check out the link.

userbinator · on June 5, 2018

1. Web site is just a blank white page on mobile Safari.

Looking at the source, it's a bunch of JavaScript with what looks like the content in a very mildly obfuscated form, e.g.

    ["non-optimal"],["al","go","rithm"],["that"],["you"],["un","

If your goal was to try to prevent dumb scraping, you've succeeded; but for accessibility and general efficiency, it is a failure.

...unless of course the whole page is a demo of using an LR parser to consume that oddly formatted content and output HTML, in which case I apologise. But a <noscript> with an explanation would be better than a blank page.

dlisboa · on June 5, 2018

My guess is he's probably using a JavaScript library to provide text justification. It has some pretty great typography for a web page, hyphenation on the margin and all that. Fancy stuff. Some parts are way better than what CSS gives you.

Sucks that it doesn't show anything, but it's a bit ridiculous that 40 years after TeX we don't have proper typography on browsers and people have resort to stuff like this simply to justify text. Not that I agree with the choice, but I understand it.

repsilat · on June 5, 2018

> expressions handled by precedence climbing

I fixed up operator precedence in a hobby project a few weeks ago, and found a really fun way to do it that turned out to require almost no code (but half a page of comments :-).

Basically I just parsed left to right, ignoring precedence (making everything right-associative and equal precedence) and then fixed up the precedence in a simple term rewriting pass after the fact.

The worst-case asymptotic complexity isn't super (quadratic for chains of multiplications, I think) but the code was so satisfying to write and the parser code itself is unbelievably simple.

Oh man I love to abuse term rewriting, I get so much mileage out that one little function...

himom · on June 4, 2018

Exactly. This is one of the many reasons why most parsers are written top-down. Bison, which is a bottom-up GLR parser generator, does token location tracking, but it doesn’t help with annotating code properly for human-readable warnings and errors. YACC was the UNIX inspiration for Bison, but it was LALR(1).

OTOH, top-down parsers LL(0), LL(1), LL(k) are far easier to maintain because there’s a straightforward correspondence of BNF to code. It’s not necessarily the fanciest

fao_ · on June 4, 2018

By precedence climbing are you referring to TDOP (Top-Down Operator Precedence)?

dbcurtis · on June 4, 2018

No, let me copy and paste the first hit from Google:

https://eli.thegreenplace.net/2012/08/02/parsing-expressions...

taftster · on June 4, 2018

From your link:

> Update (2016-11-02): Andy Chu notes that precedence climbing and TDOP are pretty much the same algorithm, formulated a bit differently. I tend to agree, and also note that Shunting Yard is again the same algorithm, except that the explicit recursion is replaced by a stack.

So, let's be careful about pasting a "first hit from Google" that you might not be discussing close to the same thing.

jhpriestley · on June 4, 2018

Sorry about the website, I didn't think I was using anything unsupported by the major browsers. I'll add a transpiler soon and hopefully that will fix things for you.

metafunctor · on June 5, 2018

Besides the site not rendering at all on some (important!) browsers, cut'n'pasting text also doesn't work: spaces are not included.

Are you sure you couldn't live with "text-align: justify; text-justify: inter-word;" and lang="en"? If you can't, maybe just make separate PDF downloads of your articles, if you want to control the typography in this kind of detail.

klmr · on June 5, 2018

And if OP is interested in improving the typography, a trivial, side-effect free improvement that’s currently missing would be to replace dashes and quotation marks by their “proper” glyphs, and using non-breaking whitespace where appropriate.

uwu · on June 4, 2018

the font loader script is causing it to never load for me

i run latest chrome with the --disable-remote-fonts command line option

jhpriestley · on June 5, 2018

I've added a timeout on the font loader, you should make it through after a few seconds, although I don't know how good the page will look without the custom fonts.

I've also added a transpiler.

andreareina · on June 5, 2018

The layout engine is making the lines too long, which combined with hard line breaks means that this is what I get:

    ************************
    ***
    ************************
    *****
    ************************
    **
    ************************
    ****

klmr · on June 5, 2018

Honestly, the custom layouting on this website is cute but I wish the author would ship proper HTML code instead and accept the slightly (if that!) inferior layouting of the browser engine.

At the moment the layouting als breaks the expected behaviour when selecting the text (double-click usually selects a word, but on this website it selects a seemingly arbitrary sentence fragment). But this doesn’t matter because copy&paste is broken anway, since all spaces are removed (but random hyphens are instead included. uhm.)

Furthermore, the layout engine seems to attempt some microtypography corrections implemented by pdfTeX but does so relatively badly (dashes protrude too far, other punctuation doesn’t protrude at all; punctuation is broken incorrectly across lines; ). In sum, I’m not convinced this layout is actually even better than the browser’s, even ignoring the bugs.

yuchi · on June 4, 2018

The code snippets use JS in a very non idiomatic way — which makes this post interesting IMHumbleO

There are few mutations that could be removed to have a clearer/purer functional approach, BTW

WallWextra · on June 4, 2018

I recently was looking at the original LR paper by Knuth, and for the first time noticed that it said "Case Institute of Technology" at the top.

i.e. he discovered it as an undergrad.

pmiller2 · on June 5, 2018

To be fair, there was a lot more not terribly complex yet still foundational stuff to be discovered then.

isaachier · on June 4, 2018

What language is being used in the snippets?

jhpriestley · on June 4, 2018

Javascript

isaachier · on June 4, 2018

Lol I thought it might be. The dialect has changed a lot since I last dabbled in JS.

ninjakeyboard · on June 4, 2018

It's come a very long way.

rimliu · on June 5, 2018

Not really.