This is a really fun exercise, I recommend every programmer try it once. You can replace most of the tokenization code with re.Scanner[1], which also allows you to have strings without worrying about `code.split()` messing them up.
This seems great. When I read these kinds of explanations, it always strikes me how the way you're supposed to write a programming languages is basically the same way someone who had never written one might do it. Just break the input up into units, map the units to some category, and connect functionality to that. Like if you forced someone who had 2 weeks of programming experience to come up with a way to turn code into programs, they would probably think of something like the lexer/parser/evaluater process if you gave them enough time
You might enjoy https://hokstad.com/compiler . I could never get on with the traditional top-down presentation of building a compiler where you start with this idea that you're going to build a parser and a lexer and all that. Seeing someone build a compiler the same way you'd build a regular program - start from the simple cases and expand out to cover what you need to - was a real revelation. In practice it's ended up with something pretty similar to the standard architecture, but now I can understand why.
- the `tokens` method returning a variable-length tuple is a very bad idea; modern Python supports the `datatype` decorator which should be used in this case
- I'd strongly recommend using a regex for lexing; much easier to get it right than a long if/elif/else block
- I'd recommend refactoring code and implementing the `next_token` logic so that it works for any generator; an example implementation could be (WARNING: I wrote this in 5min, would require extensive unittests!)
- statements with a "prefix" (like `print X` or `if Y`) are very easy to parse (just use the lookahead token!) but when you get to parsing expressions, I strongly recommend using a Pratt parser (extensible operator precedence parsing)
Great article. If someone is looking for a more advanced example, some time ago, as an exercise, I created an interpreter for a Python-like language in Python https://github.com/akrylysov/abrvalg
I love Python but after taking PL in Racket (a Lisp-like language) it's hard to imagine implementing toy languages otherwise. Seems like Python 3.10's match statements might come in handy.
Great article. I’ve always wanted to try and build a toy language for the fun of it, great to see something that breaks it down so well.
Curious if anyone has any additional thoughts to add onto this for anyone who wants to build one themselves?
I have experience in C++, so will likely use that or rust (I imagine rust’s built-in methods might make it way easier at least for the parser, although idk if ownership rules get painful for work like this).
Yes, my thought is to read Bob Nystrom's "Crafting Interpreters". It's excellent, in-depth but not academic, and the writing style is great. https://craftinginterpreters.com/" He also has a lot of good thoughts about language design.
Really great content, and an interesting exercise!
A while back, I wanted to do something similar but built on top of a low-level language to get good performances on a specific use case. I'd love to see an article about something built on top of Rust or C (like a retrospective analysis on success/mistake of Python over the years)
Once you have an intuition, you just write the parser yourself.
I'm not a fan of parser generators. I find them restrictive. It is also notable most mainstream languages do not use tools like YACC, they have "handmade" parsers and compilers.
This may seem like I'm suggesting pie-in-the-sky approach from the ground up for something that may never need such control. Could be. But I don't feel like learning YACC is easier than learning how to parse anything in general. I think YACC and so on are holding people back from truly expressing themselves with custom languages, and after all that was the whole point of them writing a language or wasn't it?
So let's say, you decide to create your own programming language, now you can automatically afford to do as much work as the organizations behind mainstream programming languages? Organizations with thousands of contributors, funding, etc.?
Now, let's omit that for a moment. Let's say that after having implemented your language after years of work, now for some reason you got 1 prospect user.
That person will ask you about: syntax highlighting, linting, code navigation, testing, packaging, documentation, operating systems/architecture support, etc. Who is going to contribute all that if you are busy writing everything by hand?
Unless you have as much time as Terry Davis, you are better off starting by piggybacking on something that exists. Then, once you have an idea that scales, you can convince other people to help you and have a successful project. Then you can have a viable ecosystem that people can comfortably join.
The real problem about parsing is not parsing the formal grammar. It's deciding what to store about the parsed input (CST to AST) and designing the ancillary datastructures for further processing, integrating the parser with the rest of the application, etc.
The real problem about building a compiler is everything else but not parsing. Parsing is the easiest part. Once you've written a view recursive descent parsers you will probably agree.
The parser for syntax highlighting is probably a different one than the parser for AST generation and compilation. So is a parser for batch compilation different to one for incremental compilation. Even the syntax might be different.
Using YACC doesn't give you magical integration with IDEs, operating systems, testing, documentation and so on. I feel as if you took my specific comment about parsers (which are not hard to write at all, few hundred lines of code for a moderately complex language) in an inappropriately general way as in "don't use anything for anything when doing anything".
That's agreeable. We seems to disagree on how to achieve that ideal - on the specific point of whether to use a parser generator or not. The correct answer is not :)
I'm not sure why this intuition is so strong, I had it myself before learning this stuff in my early 20s on shipping projects (both existing, and from scratch) and at the end of that I was still in my early 20s.
Despite this intuition it's all actually very learnable and my experience is that using a parser generator makes no sense in a professional context and very little sense in a hobby/toy project context.
Why do you think people are writing toy programming languages?
It's a learning exercise. Sure, if you want to learn to use yacc, or whatever, use yacc. I don't think time efficiency is a useful metric when the overwhelming majority of people is this space aren't doing it for their job, or even because they 'need' to create a new language.
Ultimately you could just use python, it'll be faster and more fully featured.
You will likely not get a perfect syntax right away. So you will need to iterate on it until you are happy with how it looks and feels, and doing it while developing the compiler yourself will naturally lead to a lot of unnecessary bugs and tech debt.
Alternatively, you iterate using a cheaper method and once you reach a point in which you are content with the syntax, you can consider writing a compiler for it.
You can also avoid iterating on the syntax so you can ship a compiler as fast as possible, but users won't like the syntax and won't use your language.
I don't really think of yacc as having much more abstraction than a regex engine -- like in an (NFA-based) regex engine, you're building a big data structure a human wouldn't want to hand-engineer from a declarative specification. (And, like in regexes, you've got good theory backing it.)
By "good theory," I mean being able to know properties of the parser as a result of the formalism.
For example, one can prove that any yacc-generated parser never infinite loops and runs in O(n) (at least for the actual parsing; you can write custom code in actions that runs when a construct has been parsed, and of course you can write whatever you want there).
If you hand-write a grammar like the following, you'll end up with an infinite loop without restructuring the grammar; such restructuring is of course much easier if you have written the grammar out, rather than having it implicitly exist in code.
Expr ::= Expr '[' Expr ']'
Expr ::= Name
Expr ::= Int
If you try to naively parse the following, you'll never parse the marked production either. Something like yacc will warn you about this, while an ordinary compiler compiling a hand-written parser won't.
> For example, one can prove that any yacc-generated parser never infinite loops and runs in O(n)
Trivial when actually writing the code.
And your example of infinite left recursion is basics in parser theory. No one in their right mind would write a hanging program like that, at least not for long.
Left recursion is basic, sure, but it's not a problem for an (LA)LR parser generator like yacc. (Depending on your parse actions, you might even get slightly better memory consumption when using left recursion rather than right recursion.)
[1] https://news.ycombinator.com/item?id=36517749