A toy programming language in 137 lines of Python code

BoppreH · on July 2, 2023

This is a really fun exercise, I recommend every programmer try it once. You can replace most of the tokenization code with re.Scanner[1], which also allows you to have strings without worrying about `code.split()` messing them up.

[1] https://news.ycombinator.com/item?id=36517749

Fraterkes · on July 2, 2023

This seems great. When I read these kinds of explanations, it always strikes me how the way you're supposed to write a programming languages is basically the same way someone who had never written one might do it. Just break the input up into units, map the units to some category, and connect functionality to that. Like if you forced someone who had 2 weeks of programming experience to come up with a way to turn code into programs, they would probably think of something like the lexer/parser/evaluater process if you gave them enough time

lmm · on July 3, 2023

You might enjoy https://hokstad.com/compiler . I could never get on with the traditional top-down presentation of building a compiler where you start with this idea that you're going to build a parser and a lexer and all that. Seeing someone build a compiler the same way you'd build a regular program - start from the simple cases and expand out to cover what you need to - was a real revelation. In practice it's ended up with something pretty similar to the standard architecture, but now I can understand why.

IIAOPSW · on July 3, 2023

If you gave them enough time, they would have more than 2 weeks of programming experience.

bombolo · on July 3, 2023

From personal experience… The stuff I came up before I studied compilers didn't resemble that structure at all.

tomp · on July 2, 2023

short code review:

- the `tokens` method returning a variable-length tuple is a very bad idea; modern Python supports the `datatype` decorator which should be used in this case

- I'd strongly recommend using a regex for lexing; much easier to get it right than a long if/elif/else block

- I'd recommend refactoring code and implementing the `next_token` logic so that it works for any generator; an example implementation could be (WARNING: I wrote this in 5min, would require extensive unittests!)

    class Peek:
      def __init__(self, it):
        self.cur, self.nxt = itertools.tee(it, 2)
        self.current = next(self.cur)
      def __next__(self):
        self.current = next(self.cur)
        return next(self.nxt)

- statements with a "prefix" (like `print X` or `if Y`) are very easy to parse (just use the lookahead token!) but when you get to parsing expressions, I strongly recommend using a Pratt parser (extensible operator precedence parsing)

benhoyt · on July 2, 2023

> modern Python supports the `datatype` decorator

I presume you mean the @dataclass decorator?

tomp · on July 2, 2023

indeed :D

KAdot · on July 3, 2023

Great article. If someone is looking for a more advanced example, some time ago, as an exercise, I created an interpreter for a Python-like language in Python https://github.com/akrylysov/abrvalg

PixyMisa · on July 3, 2023

Thank you for this. It's good to see examples that go beyond the toy stage but are still this readable.

intalentive · on July 2, 2023

I love Python but after taking PL in Racket (a Lisp-like language) it's hard to imagine implementing toy languages otherwise. Seems like Python 3.10's match statements might come in handy.

nighmi · on July 2, 2023

The best Racket textbook (Beautiful Racket) is even focused around building new languages!

kdamica · on July 2, 2023

Grinberg is great. The flask mega tutorial taught me so much and I’ve recommended it so many times.

mr_00ff00 · on July 2, 2023

Great article. I’ve always wanted to try and build a toy language for the fun of it, great to see something that breaks it down so well.

Curious if anyone has any additional thoughts to add onto this for anyone who wants to build one themselves?

I have experience in C++, so will likely use that or rust (I imagine rust’s built-in methods might make it way easier at least for the parser, although idk if ownership rules get painful for work like this).

benhoyt · on July 2, 2023

Yes, my thought is to read Bob Nystrom's "Crafting Interpreters". It's excellent, in-depth but not academic, and the writing style is great. https://craftinginterpreters.com/" He also has a lot of good thoughts about language design.

fgfm · on July 2, 2023

Really great content, and an interesting exercise! A while back, I wanted to do something similar but built on top of a low-level language to get good performances on a specific use case. I'd love to see an article about something built on top of Rust or C (like a retrospective analysis on success/mistake of Python over the years)

gabrielsroka · on July 2, 2023

Actual title "Building a Toy Programming Language in Python"

codetrotter · on July 2, 2023

I think using a different title on HN is fine when OP is the author, such as here

gabrielsroka · on July 2, 2023

I noticed that after I posted it. But I still have a "web" button in my HN browser, and it doesn't work unless the title is correct.

Of course, the author can change the title on the website as well.

29athrowaway · on July 2, 2023

Once you have an intuition, you can use yacc.

3cats-in-a-coat · on July 2, 2023

Once you have an intuition, you just write the parser yourself.

I'm not a fan of parser generators. I find them restrictive. It is also notable most mainstream languages do not use tools like YACC, they have "handmade" parsers and compilers.

This may seem like I'm suggesting pie-in-the-sky approach from the ground up for something that may never need such control. Could be. But I don't feel like learning YACC is easier than learning how to parse anything in general. I think YACC and so on are holding people back from truly expressing themselves with custom languages, and after all that was the whole point of them writing a language or wasn't it?

29athrowaway · on July 2, 2023

So let's say, you decide to create your own programming language, now you can automatically afford to do as much work as the organizations behind mainstream programming languages? Organizations with thousands of contributors, funding, etc.?

Now, let's omit that for a moment. Let's say that after having implemented your language after years of work, now for some reason you got 1 prospect user.

That person will ask you about: syntax highlighting, linting, code navigation, testing, packaging, documentation, operating systems/architecture support, etc. Who is going to contribute all that if you are busy writing everything by hand?

Unless you have as much time as Terry Davis, you are better off starting by piggybacking on something that exists. Then, once you have an idea that scales, you can convince other people to help you and have a successful project. Then you can have a viable ecosystem that people can comfortably join.

jstimpfle · on July 2, 2023

The real problem about parsing is not parsing the formal grammar. It's deciding what to store about the parsed input (CST to AST) and designing the ancillary datastructures for further processing, integrating the parser with the rest of the application, etc.

The real problem about building a compiler is everything else but not parsing. Parsing is the easiest part. Once you've written a view recursive descent parsers you will probably agree.

The parser for syntax highlighting is probably a different one than the parser for AST generation and compilation. So is a parser for batch compilation different to one for incremental compilation. Even the syntax might be different.

3cats-in-a-coat · on July 2, 2023

Using YACC doesn't give you magical integration with IDEs, operating systems, testing, documentation and so on. I feel as if you took my specific comment about parsers (which are not hard to write at all, few hundred lines of code for a moderately complex language) in an inappropriately general way as in "don't use anything for anything when doing anything".

laxd · on July 2, 2023

.... "syntax highlighting, linting, code navigation, testing, packaging, documentation, operating systems/architecture support"

This has nothing to do with using a parser generator or not.

29athrowaway · on July 2, 2023

It has to do with having time to spend in creating a convenient development experience and picking your battles.

laxd · on July 2, 2023

That's agreeable. We seems to disagree on how to achieve that ideal - on the specific point of whether to use a parser generator or not. The correct answer is not :)

29athrowaway · on July 3, 2023

What you consider correct or not is simply an opinion.

Tade0 · on July 2, 2023

I'm not well versed in this space. Are you suggesting designing a language without specifying a formal grammar?

jstimpfle · on July 2, 2023

One does not preclude the other. Most languages that lend themselves to hand-parsing via recursive descent are probably LL(1).

Syntaxes that are easy to parse via recursive descent are probably also the ones that are easiest to parse for humans.

miguelgrinberg · on July 2, 2023

Knowing how to build a parser manually will make you much much better at generating, and most importantly debugging, automatically generated parsers.

29athrowaway · on July 2, 2023

Yes. And once you are done learning every aspect of it, one day you will look yourself in the mirror and you will be 60 years old.

furyofantares · on July 2, 2023

I'm not sure why this intuition is so strong, I had it myself before learning this stuff in my early 20s on shipping projects (both existing, and from scratch) and at the end of that I was still in my early 20s.

Despite this intuition it's all actually very learnable and my experience is that using a parser generator makes no sense in a professional context and very little sense in a hobby/toy project context.

PixyMisa · on July 3, 2023

With any luck that will happen to you sooner or later.

Better to have it happen knowing you've mastered some subject.

benj111 · on July 2, 2023

Why do you think people are writing toy programming languages?

It's a learning exercise. Sure, if you want to learn to use yacc, or whatever, use yacc. I don't think time efficiency is a useful metric when the overwhelming majority of people is this space aren't doing it for their job, or even because they 'need' to create a new language.

Ultimately you could just use python, it'll be faster and more fully featured.

laxd · on July 2, 2023

And lose touch with your code and having too deal with problems through a heavily abstracted layer. Great!

29athrowaway · on July 2, 2023

You will likely not get a perfect syntax right away. So you will need to iterate on it until you are happy with how it looks and feels, and doing it while developing the compiler yourself will naturally lead to a lot of unnecessary bugs and tech debt.

Alternatively, you iterate using a cheaper method and once you reach a point in which you are content with the syntax, you can consider writing a compiler for it.

You can also avoid iterating on the syntax so you can ship a compiler as fast as possible, but users won't like the syntax and won't use your language.

mcluck · on July 2, 2023

I have a feeling you haven't worked on parsers or compilers that much if you think you design all of the syntax before building anything

remexre · on July 2, 2023

I don't really think of yacc as having much more abstraction than a regex engine -- like in an (NFA-based) regex engine, you're building a big data structure a human wouldn't want to hand-engineer from a declarative specification. (And, like in regexes, you've got good theory backing it.)

laxd · on July 2, 2023

You got good theory backing hand-written parsers, and the code is there right in your face. I don't know how a regex engine works.

remexre · on July 2, 2023

By "good theory," I mean being able to know properties of the parser as a result of the formalism.

For example, one can prove that any yacc-generated parser never infinite loops and runs in O(n) (at least for the actual parsing; you can write custom code in actions that runs when a construct has been parsed, and of course you can write whatever you want there).

If you hand-write a grammar like the following, you'll end up with an infinite loop without restructuring the grammar; such restructuring is of course much easier if you have written the grammar out, rather than having it implicitly exist in code.

    Expr ::= Expr '[' Expr ']'
    Expr ::= Name
    Expr ::= Int

If you try to naively parse the following, you'll never parse the marked production either. Something like yacc will warn you about this, while an ordinary compiler compiling a hand-written parser won't.

    Stmt ::= Expr ';'
    Stmt ::= Name '=' Expr ';'
    Stmt ::= Expr '[' Expr ']' '=' Expr ';'
    Expr ::= Expr '[' Expr ']'
    Expr ::= Name
    Expr ::= Int

[0] is a good introduction to regex matching, [1] is a good introduction to the algorithm yacc uses, if you want to learn either.

[0]: https://swtch.com/~rsc/regexp/regexp1.html [1]: https://web.archive.org/web/20210507215636/http://web.cs.dal...

laxd · on July 2, 2023

> For example, one can prove that any yacc-generated parser never infinite loops and runs in O(n)

Trivial when actually writing the code. And your example of infinite left recursion is basics in parser theory. No one in their right mind would write a hanging program like that, at least not for long.

remexre · on July 2, 2023

Left recursion is basic, sure, but it's not a problem for an (LA)LR parser generator like yacc. (Depending on your parse actions, you might even get slightly better memory consumption when using left recursion rather than right recursion.)

kmstout · on July 3, 2023

Ignoring all the work implementing a parser generator, here's a crappy language in fewer than 100 lines:

https://reindeereffect.github.io/2019/01/16/index.html