Hacker News new | past | comments | ask | show | jobs | submit login
A Handwritten Math Parser in 100 lines of Python (github.com/gnebehay)
64 points by gnebehay on Sept 17, 2020 | hide | past | favorite | 38 comments



Slightly more amusingly, but non-equivalently:

  import sys
  from functools import reduce
  from operator import add, sub, mul, truediv as div

  def calc(s, ops=[(add, '+'), (sub, '-'), (mul, '*'), (div, '/')]):
    if not ops:
      return float(s)

    (func, token), *remaining_ops = ops
    return reduce(func, (calc(part, remaining_ops) for part in s.split(token)))

  if __name__ == '__main__':
    print(calc(''.join(sys.argv[1:])))

Does support order of operations, in spite of appearance. Extending to support parentheses is left as an exercise for the reader :)


It's pretty horrible, but this adds support for parentheses inside the `full_eval` function:

  from itertools import takewhile, dropwhile

  def split_paren(expr):
      left = ''.join(takewhile(lambda c: c != '(', expr))
      right = ''.join(takewhile(lambda c: c != ')', expr[::-1]))[::-1]
      inside = ''.join(dropwhile(lambda c: c != ')', ''.join(dropwhile(lambda c: c != '(', expr))[::-1]))[::-1][1:-1]
      return left, inside, right

  def full_eval(expr):
      if '(' in expr:
          left, inside, right = split_paren(expr)
          return full_eval(left + str(full_eval(inside)) + right)
      else:
          return calc(expr)


Neat! Well, sort of. Sometimes, certain things would look nicer not written in Python...


This is pretty clever. There's a video by Jonathan Blow, creator of Braid and currently making a programming language called Jai, that talks about parsing expressions with order of operations. He gives an simple approach that apparently was discovered recently (though he didn't remember the paper name) that does this in a single pass and hardly any more complicated than what you wrote. You can see it here [1]. He also trashes on using yacc/bison before that and on most academic parsing theory.

[1] https://www.youtube.com/watch?v=MnctEW1oL-E&t=4578s


Reverse Polish notation (RPN) > Converting from infix notation https://en.wikipedia.org/wiki/Reverse_Polish_notation#Conver... > Shunting-yard algorithm https://en.wikipedia.org/wiki/Shunting-yard_algorithm

Infix notation supports parentheses.

Infix notation: 3 + 4 × (2 − 1)

RPN: 3 4 2 1 − × +


I would like to know the name of that paper if anyone knows.


I believe he was talking about Pratt parsing in the video.

Top Down Operator Precedence - Vaughan R. Pratt - https://tdop.github.io/

What he called "priority" is called "binding power" in the paper.


I like Jonathan Blow and all, but he should be careful with trashing academic parsing theory if he thinks that Pratt parsing was only discovered "in the last few years".


It sounds a lot like Pratt parsing, or at least the way I usually implement it. But I don't what paper he is referring to.


I think you have expanded my python horizons, thank you!


> Luckily, one can make the grammar LL(1)-parser-friendly by rewriting all the left recursions in the grammar rules as right recursions.

The problem with this is that it changes all the left-associative operators to be right-associative. For example, `python compute.py '3-2-1'` gives the result 2, but it should be 0.

Edit: Why the downvote? Am I wrong?


You are correct, the result 2 from the submitted parser.py is erroneous. Downvoters should be ashamed of their abusive behaviour.

It is not well known amongst programmer who would like to use parsers that most algorithms as implemented in countless software libraries are wrong and broken beyond repair. We should strive to make it a well-known fact and promote those parsers that work properly, so that less programmers and end-users need to suffer.


This does not properly handle associativity.

It erroneously computes "7-4+2" as 1, not 5.


Doesn't shunting yard solve this pretty simply? Once you have an expression in RPN, building expression trees is pretty easy.


I recently did this same project in Javascript and you're correct, shunting yard is perfect for this.


I did a project with Shunting-Yard in Golang and yes, that's the proper way to do this.


Similar calculator written in C++ by Bjarne Stroustrup: https://www.stroustrup.com/dc.c and in Go by me: https://github.com/glxxyz/snippets/blob/main/calculator/dc.g...

Not that much longer, and supports in interactive session with named variables.


When I saw "handwritten math" I thought it was something in the vein of https://photomath.app


shameless plug. probably messy code but same mission.

https://github.com/kuesji/mec


    (3) Exp -> Exp2
    (4) Exp2 -> Exp2 * Exp3
    (5) Exp2 -> Exp2 / Exp3
    (6) Exp2 -> Exp3
This is horrible to read and to maintain. Real grammars use names like "additive_expr", "multiplicative_expr" etc., or maybe "term" and "factor", though I can never remember which is which.


For me it's actually the opposite, I find using the priority makes it actually a lot more readable, especially once you have a lot of operators. It's also quite close to a generalized (and potentially dynamic) version that is just a function that takes in a priority and returns the correct parser.


It's not that I'm opposed to priorities, it's just that I think it's unreasonably hard to see the typos here:

    (3) Exp -> Exp2
    (4) Exp2 -> Exp2 * Exp3
    (5) Exp3 -> Exp2 / Exp2
    (6) Exp2 -> Exp3
There are reasons why we usually give variables more meaningful names than var1, var2, var3.

(Obviously I agree that if the parser can be expanded dynamically with operators with numeric priorities, then you need numeric priorities. But that is a very different case.)


> though I can never remember which is which

If you can't remember which is which, this is a good indication that the names don't matter, and in fact they don't, because it's an artifact of the way you need to write the grammar.

I'm much more in favor of Exp/Exp2/Exp3.


It's not term and factor, it's just operator precedence. Hence 1,2,3 is appropriate.


Yes it is horrible, but it doesn't make a grammar less real, labels don't matter...


labels definitely matter. Naming is one of the two hardest things [0].

The point in this case is that the names are hard because they're making up for limitations in left-recursive grammars. Personally, I still find "Addable_expr" to be more intuitive than "additive_expr": where it means "an expression that can be added". Both are more intuitive (to me) than "Expr2". But that's personal preference, and part of why naming is hard :).

[0] https://martinfowler.com/bliki/TwoHardThings.html


I am not disagreeing with you, I just think that the "grammar" still works even with terrible naming. It is harder to read, it is harder to maintain, but it is still real. Perhaps I am missing some subtler point here...


When I wrote "real grammars" above I meant the grammars that people write to specify "real" widely used production-quality programming languages.

I linked to the C standard elsethread: https://web.archive.org/web/20161223125339/http://flash-gord...

Here is the Java specification: https://docs.oracle.com/javase/specs/jls/se8/html/jls-15.htm...

Here is the Python language reference (with terser names, but definitely "m" for multiplicative and "a" for additive expressions): https://docs.python.org/3/reference/expressions.html#binary-...

These are "real" grammars. As opposed to a "toy" grammar that someone throws on GitHub to demonstrate an "X in 100 lines" thing.


Thanks, i am enjoying reading this!


Try your idea. It won't work. The reason is that they aren't "addable", they are precedence1 and precedence2.

You can change Exp to Precedence if you wanted but that's just longer.


"It won't work" is a hilarious claim seeing how actual language standards do exactly this, and it works well. See the C standard for example: https://web.archive.org/web/20161223125339/http://flash-gord... . Search for "additive-expression" to see the relevant grammar rules.


Really nicely-written project. The documentation is clear and sufficiently comprehensive; the project structure is immediately obvious and navigable; and each source file is both intuitively named and very readable.


Where would I read up on what's currently considered best practice for implementing a parser? We learned recursive descent in school, but I have heard several remarks that this is no longer considered en-vogue.


I don't know about best practice, but having implemented a few different parsing algorithms for the same language, I found that "Pratt parsers" to have unique advantages, keeping the logic simple, clear, and flexible.

At least for the particular needs of that language's syntax, and my preference for hand-rolled parsers (rather than generated from grammar), it was the best solution. It allowed me to elegantly solve some syntactic groupings that were difficult/complex to parse with other approaches.

Here's an excellent overview of the algorithm, with examples in Python:

Pratt Parsers: Expression Parsing Made Easy - http://journal.stuffwithstuff.com/2011/03/19/pratt-parsers-e...

A newer article with implementation in Rust:

Simple but Powerful Pratt Parsing - https://matklad.github.io/2020/04/13/simple-but-powerful-pra...


Recursive descent is what I use today. En-vogue or not, it's the most robust, flexible and - most importantly - maintainable approach in the long term.


It's also easy to explain to someone who has not studied parsing, and easy for them to implement. That was why I chose recursive decent when I needed to trick someone into writing a C compiler.

This was back in the early '80s, back when there was much more variety in personal computers both in CPU architecture and operating system, and C was fairly rare on such small machines.

A friend who was a really good programmer but who had not taken the compiler course in college (he was a math major, not a CS major) had started a company aiming to produce a C compiler for a new small computer that was coming out in in a few months, with the compiler meant to be available as soon as the computer came out. He'd hired another friend to write the compiler.

This was well before the internet, when advertising was a lot slower than it is now. The company had to buy its ads in the relevant computer magazines well ahead of time.

That guy writing the compiler was a genius, and was on track to produce an amazing compiler especially given the limited resources of the personal computers back then...but it turned out he was not a fast genius. The compiler would not be finished until long after the new computer launched.

The first friend was getting rather depressed, knowing he had to get someone faster to write the first version of the compiler but not knowing anyone else who could do it and was available.

So one evening I decided to convince him that he could do it himself. I sat him down in front of a whiteboard, and told him he needed three things: a lexical analyzer to tokenize the input, a parser to parse the token stream, and a code generator to turn the parse tree into assembly.

I then outlined recursive decent parsing on the whiteboard, giving examples of using it on C. I was able to convince him that parsing was not hard, and that he could do. I then moved on to code generation, and showed him that he could generate code for a simple stack-based virtual machine, and that could easily then be mapped to the real, register-based CPU--it would be inefficient, but I also went over how later he could add a simple optimizer that could remove a lot of the inefficiency.

Once I had him convinced that he could handle parsing and code generation and some optimization by their deadline, I pretended we were done and started to leave. He interrupted and asked about lexical analysis.

I acted embarrassed that I had forgotten that, and "admitted" that it was the hardest part, and said I'd try to think of something and went home.

Of course that was a lie. When I got home, I quickly wrote a lexical analyzer for C, and then the next morning went back and handed it to him.

He now actually believed that the hardest part was done, and was now sure that he could do the remaining easier parts himself. And he did. We both got a good laugh later when I finally told him that lexical analysis is the easiest part.


I just tested a recursive descent parser and it fails a lot. Since I could only find one working-off-the-shelf software on short notice, I do not know whether the failures are the fault of the algorithm or the implementation. Though due to prior experience, I suspect it's the fault of the algorithm, and if that's the case, we should not use it to implement parsers. I would gladly test more recursive descent software if you could show me some and share a link.

Test results: https://paste.debian.net/plainh/cd5124a3


The C# compiler uses recursive descent.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: