Hacker News new | past | comments | ask | show | jobs | submit login
Operator precedence by textual substitution (kmjn.org)
90 points by zdw on May 16, 2022 | hide | past | favorite | 26 comments



This is neat! Some if the alternative solutions discussed here seem to confuse compilation with evaluation. Fortran is trying to rewrite the computation in such a way that, later on, a machine that knows nothing about precedence can evaluate the expression. It's incredible that such a rewrite can be performed from left to right without recursion and while using only O(1) memory at a time.


For more parsing history, see Jeffrey Kegler's "Parsing: a timeline": https://jeffreykegler.github.io/personal/timeline_v3

For a more elegant solution, see these tiny languages parsed using Floyd's operator precedence parsing:

http://breuleux.net/blog/language-howto.html

https://news.ycombinator.com/item?id=8554084


The Wikipedia article about Operator Precedence Parser [1] is very informative. It describes this technique as well as other ones such as Pratt parser and precedence climbing.

[1] https://en.wikipedia.org/wiki/Operator-precedence_parser


I would suggest against relying too closely on this article in its current state. The "precedence climbing" code is needlessly complicated -- it has a nested loop which doesn't do anything, only one loop is necessary -- and it doesn't really explain what's going on (which is a relatively simple idea), just writes some pseudocode and mechanically evaluates it.


Pretty neat! Takes advantage of the fact that the parser needs to use a recursion or stack to handle parenthesis. If it can handle parenthesis, it has all the tools necessary to handle operator precedence. If we think of how a recursive descent parser would implement operator precedence, by means of one level of recursion per each precedence level, this trick is basically encoding that in terms of parenthesis.


The problem where it messes up precedence if you already use parentheses, so that you need to add more parentheses to the substitutions, reminds me of the problem with one common explanation of CSS specificity. Simplifying the model a bit, specificity is a tuple of (id, class, tag), so that `#foo #bar` is (2, 0, 0) and greater than `#foo .baz` which is (1, 1, 0). Well, some tried saying “treat it as a number”, leading to 200 and 110, and you can clearly see which number is bigger. The trouble with that approach is the numerical base: no number of classes will outweigh an ID, so (1, 11, 0) is not greater than (2, 0, 0), but if you want to represent it as a three-digit number, well, you’re stuck. Just like you can here add more parentheses to the substitution to keep it working, you could increase the numeric base; e.g. at this point hexadecimal would still work, 0x1b0 < 0x200.

Not sure exactly what to call this category of limitation, but you see it in diverse places where people have designed things without considering the upper limits of what they can cope with.


A common issue is trying to store a variable sized value in fixed size storage, with no test that the value fits!


Implementing operator precedence isn't that hard actually.

Essentially you take note of the first operator's precedence and keep consuming input until you reach an operator with lower precedence than you currently have. The precedence you use for comparison is updated each step.

Here's some code that does that: https://github.com/uibk-dps-teaching/ps_cc_2021/blob/0ad022c...

Only thing you have to watch here: with a recursive descent parser you might need an extra step to fix the associativity (right-to-left vs. left-to-right) if that's important for your application.

Here's that: https://github.com/uibk-dps-teaching/ps_cc_2021/blob/0ad022c...


It's a problem that's been around for quite a while. Like, since the dawn of computers. And it's been solved numerous times in compiler generators, explained in textbooks, etc. Right now most solutions compete on clarity, simplicity and adaptability, not solving the problem itself.

These days the general consensus is to use the Pratt parser, which is somewhat harder to understand, but ends up not too hard to build and integrate into a recursive decent parser, and is supereasy to modify.

Here's one popular explanation:

https://journal.stuffwithstuff.com/2011/03/19/pratt-parsers-...

What I find funny is that Pratt's parser was never really a theoretical result and was never properly covered in popular compiler textbook.


> What I find funny is that Pratt's parser was never really a theoretical result and was never properly covered in popular compiler textbook.

The theory behind precedence parsing was studied a decade earlier by Floyd (who is cited in Pratt's paper). Pratt's own paper is more about the justification of considering operators to have semantic importance rather than productions of nonterminals.


Yup.

And this idea makes all the difference for language implementers.


One recent book that covers it is Crafting Interpreters, by Bob Nystrom. Maybe if we give it some more time it will start counting as a popular compiler textbook ;)

https://craftinginterpreters.com/


> What I find funny is that Pratt's parser was never really a theoretical result

Sorry, what?


What I wanted to say is that it's a popular practical algorithm. But theoretically speaking it's a variant of the shunting yard algorithm. Books typically describe the latter.


It's only not hard if you have a parser that allows you to carry that kind of context. Admittedly, operator precedence is so important that most parser (generators) have some canonical way to handle it. But try to implement a language with user-defined binary operators and consider the mess at your hands.


Pratt parsing (as described in https://matklad.github.io//2020/04/13/simple-but-powerful-pr...), handles operator precedence and different operator types (prefix, infix, postfix and mixfix) in a very neat manner, and it's easy enough to stuff user-defined operators in there.


What complexity?! The precedence is a simple integer value. Even when you include associativity, there are only two options: right-to-left, left-to-right.

Doesn't matter whether that information is language defined or user defined.

For instance in Haskell, that's all the information you provide. You'd use either `infixl` or `infixr` (depending on which associativity you want) followed by an integer defining the precedence for the given operator.

The parser just needs a table to look up the precedence (and associativity) for a given operator. But a parser already needs a bunch of different tables in order to work, so that's certainly not a problem.


> The parser just needs a table to look up the precedence (and associativity) for a given operator. But a parser already needs a bunch of different tables in order to work, so that's certainly not a problem.

Think about how you'd implement an independent parser for such a language, say for the sake of pretty printing it to html. Module A uses a user-defined operator that has been defined in Module B. You effectively cannot parse A correctly before you parse and load B, but the fact that you need to parse and load B is "hidden" in some import declaration that in turn needs to be parsed first. That's what I call a mess.


What alternatives do you believe offer easier support for user-defined binary operators? Almost all parsing technologies are more rigid and more complicated that numerical precedence offers.


I have no idea, really. If I would, I would try to submit a paper to POPL or so.


I also find operator precedence quite easy to implement.

In my parser, when it encounters two binary expressions `a * b + c`, it always initially parses them as `a * (b + c)` into temporary AST nodes. Then it compares the precedence of the operators and, if needed, rewrites the nodes to form the correct expression.


> Implementing operator precedence isn't that hard actually.

Like all the claims about making languages: Truth until is a big fat lie.

Try to do the full precedence for more complex languages than formula evaluators could become harder than expected (you think you nailed, then a edge-case happens).


does it work for unary +/-?


Almost: +13 would be rewritten to ((()))+(((13))) and -13 to ((()))-(((13))). If you interpret the empty expression as 0, perhaps it would work out.

Good question!


There are other unary operators (!, ~), although I'm almost sure FORTRAN didn't support those, but 1 - -13 => 1 - 0 - 13 = -12 would fail, wouldn't it?


No, not in Fortran of that era ~ was pretty much missing from most card punch keyboards




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: