Hacker News new | past | comments | ask | show | jobs | submit login
Resources for Amateur Compiler Writers (c9x.me)
339 points by kilodeca on April 24, 2021 | hide | past | favorite | 88 comments



This is definitely a backend, machine code focused list. If you're writing a compiler to WebAssembly or even the JVM you're probably not gonna need a lot of these resources.

Indeed I'd argue that these lists are more for an intermediate compiler writer. Beginner compiler writers don't need to know register allocation. They need to understand how to think about language translation, how to write an intermediate representation, how to learn a target language.

I've been toying with the idea of writing a post that goes through the thought process of translation. Specifically, how you build states, and invariants. When I write code generation I spend time thinking about the different states the stack could be in, or what invariant I need to keep across this expression (do I need to save a register or global?).

One simple but important example of this is realizing that given a proper, sound type checker, your code generator should almost never give errors. By the time you reach code generation you should have proven enough invariants that code generation cannot fail. This isn't always true but I've found it to be the case.


> Indeed I'd argue that these lists are more for an intermediate compiler writer. Beginner compiler writers don't need to know register allocation.

I'm not disagreeing with you, but note that the title does not say it's for beginners, it says it's for amateurs, which is just anyone who does it for fun rather than money.


That reminds me of the dragon book that goes heavily into VAX assembly and optimizing for CISC machines. Probably not the best resource for beginners or amateurs these days, but it won a Turing award at least.

The problem is what it means to write a compiler is pretty broad, some amateur projects focus on backend work, some on front end, many in between.


Also tremendous amounts just on parsing C. I found Lisp In Small Pieces far more interesting.


The Dragon Book is just not very good. The newer editions spend far too long on parsing, which is not a big problem, and just make it look scary.

Grune's Parsing Techniques is a good book (IIRC when I read it ten years ago), and there are some other ones here.

https://gcc.gnu.org/wiki/ListOfCompilerBooks


> Beginner compiler writers don't need to know register allocation. They need to understand how to think about language translation, how to write an intermediate representation, how to learn a target language.

I don't think people that want to write a compiler want to end up with helloworld.c instead of helloword.exe. At least I don't.


At least for me, compiling to a simpler target made learning the ideas easier. Compiling to to raw assembly does give you more nerd credit but it also makes it harder to get started.


The opposite, really. I started learning about compilers because I wanted to implement my own syntactic ideas. Don't care at all how it runs.


That's what interpreters are great for.


Ideally, you'd emit LLVM IR or some similar bytecode or assembly. Then you'd use llc/opt or an appropriate assmebler to get an executable.

Trying to emit machine code isn't too interesting when you're writing your first compiler.


Similarly, trying to parse C++ isn't too interesting when you're writing your first backend.


The creator of C++ would disagree, as C++ was devised as a compile-to-c language.


You can certain take any path into compiler writing. I dont know anything about AST or front end compilers but ended up writing register allocators and constant folding for nvidia GPU JIT


I would be interested in reading your yet-to-be-written post!


That would be great to read, I sincerely hope you write it, post it here, and I get to read it.


A lot of ink is spilled on the details of how to make compilers go from taking text and turn it into fast executables, but I believe that the User Experience of compilers is as important a subject as those, but there's very little information about it in the literature, outside of a handful of papers and the design documents of existing compilers that do focus on this.


No idea why you are being downvoted, but it's true.

Compiler UX is often horrible, but the new breed of languages tries to rectify parts of it:

- non-crpytic error messages. Show exactly what failed, where it failed, how to solve it.

- compiler-as-a-service. Tool makers will love you.

- compilation speed. If I need to wait 2 minutes for a 10 LOC project to compile, your compiler is doing something wrong.

- the compiler should be the only tool you need. If you rely on 15 different tools just to compile a project, you're doing something wrong


As you allude to, treating a compiler as an isolated batch process is an outdated strategy. Any compiler started today should be designed such that it can be used in an IDE from the start, even if you don't implement it in that way until after you've reached a certain level of maturity.

For error messages in particular is a very expansive topic because a compiler might have the following components:

    Lexer
    Parser
    Macro Expansion
    Name Resolution
    Lowering
    Type Checking
    Privacy Checking
    Lifetime Analysis
    Code Generation
    Linking
and all of them can produce errors. If you introduce error recovery in your parser, then you better make sure you're carrying that information forward to name resolution and type checking. If you introduce a error type placeholder, better account for it during the rest of your analysis to avoid complaining about non-existing methods on something that wasn't resolved in the first place. Error deduplication. Exploring the grammar space to identify things that a human may attempt but isn't allowed for good reason but that will not blow up until some quasi-random stage. Now that I think about it you could write a whole book about these topic.


What is lowering?


Some languages require higher level internal representations ("IR"s) to be "lowered" to lower level representations before code generation - in other words like high level semantics or syntactic sugar and transform into simpler constructs to compile.

For example, a language could support special iterator semantics that will be "lowered" into a simple for loop. Or a for loop itself may be lowered into a do-while loop, depending on the design of the lowered IR.


To give you an example, rustc has an AST, and a high-level intermediate representation tree called the HIR. The former can encode anything that is syntactically valid (and some things that aren't), while the HIR only encodes things that are valid and transforms things like `if foo {} else {}` to `match foo { true => {} false => {}}`. That makes this internal representation more fundamental, there are fewer constructs to care about, and allows unifying some validations that would otherwise have to be written for all the alternative user visible constructs. It also makes later stages of the compiler not have to care about how code was written, just about this higher level representation.


I downvoted it because I think it's a meme that ignores where contemporary compiler engineering is happening. We don't even realize they exist anymore - V8, Graal, Roslyn, etc are all optimizing JIT compilers running in the background that can compile multiple languages on the fly. And they do a great job.

Not everything is a C++ compiler, which is inherently difficult to compile correctly. Let alone quickly.

I think the future of compilers is that everything will have an incremental parser and JIT compiler with optional AOT compilation for release builds. Hopefully LSP and DAP will take over the world of tooling integration.

There are other fundamental mistakes to look at than error messages and performance, like pretending that build systems don't exist and that linking multiple compilation units isn't something the compiler can do. Compiling code is getting much easier, but building it is much harder. Compilers have the most information and internal structure to handle all our build problems, but don't do it.


> - non-crpytic error messages. Show exactly what failed, where it failed, how to solve it.

I'm working on a compiler in my free time and I agree with you, but in defense of compilers that suck at this: error reporting is really, really hard to do right.


By the author of the excellent https://c9x.me/compile/

A lightweight alternative to llvm https://c9x.me/compile/doc/llvm.html


The Zephyr Abstract Syntax Description Language, Wang et al.

Has been invaluable for creating the AST (and soon the IR nodes)-- though it has been a source of much yak shaving as I've rewritten my asdl generator at least 3 times.

Language Implementation Patterns: Create Your Own Domain-Specific and General Programming Languages Book, Terence Parr

Pretty java/antlr focused but wasn't too hard to take the concepts and apply them to a C++ based transpiler I've been slowly poking at.

Combining Analyses, Combining Optimizations, Cliff Click

His thesis and an extension of the linked paper Global Code Motion, Global Value Numbering -- my next step down the rabbit hole...

Tree Automata Techniques and Applications, Comon et al.

Still on the fence on this one vs the iburg/burs algorithm, probably be more useful and match better with the way Parr's book goes about things.


I'm not an expert by far, but I feel like many people who write compilers or develop languages no longer recommend The Dragon Book as it's very outdated.

Additionally, the vast majority of compiler books and resources spend too much time on the easiest part of compiling: parsing. It's not uncommon for a book or a course dedicate 60-70% of its time to parsing. And leaves nothing to everything else.


> I'm not an expert by far, but I feel like many people who write compilers or develop languages no longer recommend The Dragon Book as it's very outdated.

Yes. There are newer editions that I can't judge, but the 1986 edition linked from the page should nowadays only be read by historians. If you want to implement a compiler, look elsewhere, for something newer.

This isn't to say that this was a bad book 35 years ago. It wasn't. But it is simply outdated. It does not contain things that are important today. It has very little treatment of type systems and type inference, for example. Its treatment of data flow analysis is based on intervals, which have limitations (though also some interesting theoretical advantages) and are used by zero compilers written in this millennium (and not even described very well in the Dragon book IIRC). And then of course there is the parsing problem, see my rant elsethread.

As for the article, I think A Catalogue of Optimizing Transformations (https://www.clear.rice.edu/comp512/Lectures/Papers/1971-alle...) would make an interesting addition.

Edit: Forgot to add that the old old Dragon book doesn't cover two topics that the article itself considers very relevant: SSA form (which I believe was discovered just around that time) and linear scan register allocation (which is much newer).


What would you recommend that covers data flow analysis better than the dragon book? I'm currently starting data flow analysis for my own compiler using intervals and/or modular congruences (operations can be reduced to congruences, then combined with the Chinese Remainder Theorem). What are the theoretical advantages of intervals and what do you recommend over that?


Ah, sorry, I should probably clarify that there are two completely distinct uses of the term "interval" in data flow analysis.

One (introduced in https://amturing.acm.org/p137-allen.pdf and discussed in the old Dragon Book) groups program points into nested "intervals", i.e., intervals are pieces of code. Analysis information is then propagated among these intervals. This is what I was referring to above. As far as I know this approach is completely obsolete by now, with everyone using methods based more or less indirectly on Kildall's worklist iteration approach (http://static.aminer.org/pdf/PDF/000/546/451/a_unified_appro...). The theoretical advantage I was referring to above was that the interval approach is AFAIR sometimes more efficient because it can take program structure (loop nesting) into account more directly.

The other meaning of "intervals" is used for an analysis domain, i.e., the kind of data you want your analysis to compute. For example, you might determine that the possible values of some variable x are in the interval [10, 20], the values of y are in the interval [5, 15], and then you can compute that the values of z = x + y must be in [15, 35]. There is nothing wrong with this approach, and any optimizing compiler is pretty much forced to use something like this. Based on your reference to modular congruences I think you probably want to do analysis of numerical values and meant this sense, which is anyway the only one in common use nowadays. Go ahead and do this. Extensions are with congruences, like "x is in [10, 20] and also congruent to 1 modulo 4" and/or "known bits", like "x is in [10, 20], and I know that (counting from the LSB) its bit 2 is 0 and bit 3 is 1".

Personally I learned about data flow analysis in a university course based on Principles of Program Analysis: https://www.springer.com/gp/book/9783540654100. It's a book that goes into a lot of mathematical depth but not much breadth of analyses, and it doesn't handle data flow analysis on SSA form. Then I learned additional details from various well-known papers like the ones linked above, and the classical early papers around analysis on SSA form, like https://www.cse.wustl.edu/~cytron/531Pages/f11/Resources/Pap.... I'm sure there must be a more straightforward text to get you there, but I don't think I can give a definite recommendation. I have vague memories of the coverage in Muchnick's "Advanced Compiler Design and Implementation" and/or Cooper and Torczon's "Engineering a Compiler" being reasonable, but I can't find my copies right now.


Yep, here's a thread from two years ago on exactly that: https://news.ycombinator.com/item?id=20436399


> Additionally, the vast majority of compiler books and resources spend too much time on the easiest part of compiling: parsing.

I disagree. Parsing is the single most important feature of building a programming language. It influences how the language can be structured and also how it should be structured for performance reasons.

Also, you're far more likely to write a parser for a language than implement anything remotely related to compilers and interpreters.

It's one thing to state that some topics might justify more content, but it's absurd to downplay the fundamental importance of parsers in putting together a compiler.


The problem with the old old Dragon Book's treatment of parsing is that it explains a particular impractical algorithm (LR(1)? LR(k) maybe? it's been a while) in way too much detail. You would need that detail if you were to implement a parser generator (for that specific algorithm), or a parser using that specific algorithm, which isn't terribly well suited for generating useful error messages for your users.

Today's computers are ridiculously powerful, and the perceived performance advantages of LR parsing are simply irrelevant. Further, even if you were to implement an LR parser, you would do it with the aid of a parser generator, which hides the boring details and saves you from needing to understand the LR algorithm itself. But more likely when you write a parser you use a parser generator using a different approach, or you write a recursive descent parser by hand. In either case, the dry parts of the old old Dragon Book are completely irrelevant.

The people criticizing the 1986 Dragon Book's treatment of parsing aren't saying that parsing is irrelevant. They are saying that parsing that specific way is either irrelevant or done for you by a tool that is not the interesting part of a compiler.


If parsing "influences how the language can be structured and also how it should be structured for performance reasons" it's probably going to be a mediocre language, in which self-imposed constraints interfere with useful special features resulting in a bland design lacking competitive advantages.

Pascal has already been cited in other comments as having been designed for ease of parsing, but the Turbo Pascal compiler is an isolated masterpiece, and high-quality compilation at decent speed is normally more useful than cheap compilation at ludicrous speed.

Maybe parsing seems a large part of constructing a compiler because it's naturally the first task and because it's likely to be entangled with designing the syntax of the language, which can pose difficulties (e.g. ambiguities, context sensitivity, messy operator precedence).


> Parsing is the single most important feature of building a programming language

It realy isn't. Also, it's the most trivial part of parsing a programming language (in a compiler).

> It influences how the language can be structured and also how it should be structured for performance reasons.

Once again, no, not really. However complex your language is, it can be parsed at 100 GB/s on a Raspberry Pi. Even if you do it with regexps. Everything that comes after parsing is very complex and is rarely explained in any detail. And everything that comes after actualy directly influences the structure, the performance etc.

It really makes zero difference if you decide that your typeing info should be written like this:

   T x = new T();
or like this:

   let x: T = T()
or like this:

   var x := T()
All this is trivial to parse. However, deciding on how you implement typing, what needs to be in the languaage to support your choice etc., now that definitely influences the structure, performance, trade-offs etc.

How much of that is in the Dragon Book? Next to zero? But surely there are 4 chapters on how to parse this.


> It realy isn't. Also, it's the most trivial part of parsing a programming language (in a compiler).

No,not really. I mean, think about it for a second: while you cannot get a compiler out of the door withot a parser, you can indeed leave all the fancy optimization features out, don't you? And if you want to get a MVP out of the door, do you focus your effort in developing optional features or the basic, critical part that ensures that your language exists?

And then there's developer experience. Syntax/grammar errors? You need the parser for that, don't you agree?

> Once again, no, not really. However complex your language is, it can be parsed at 100 GB/s on a Raspberry Pi.

Actually, you're quite wrong. RapidJSON only handles a very basic language, and doesn't need to handle any weird errors to output meaningful errors, and it boasts at most 1GB/s.

And this is a parser optimized for speed.

Perhaps if more people had a clue about compilers, this sort of errors and misconceptions wouldn't be so common.


> while you cannot get a compiler out of the door withot a parser, you can indeed leave all the fancy optimization features out, don't you? And if you want to get a MVP out of the door, do you focus your effort in developing optional features or the basic, critical part that ensures that your language exists?

> And then there's developer experience. Syntax/grammar errors? You need the parser for that, don't you agree?

These two statements are at odds with each other as having great developer experience is indeed a "fancy optimization" that you shouldn't spend too much time on when starting a project (and I say that as someone spending their time every day thinking about making compiler errors better), and part of the reason the literature around parsers is problematic is because graceful error recovery isn't explored in detail.

> And this is a parser optimized for speed.

Parse time is never the bottleneck. It is a waste of time to focus on optimizing parse speed at the beginning. It is worth it to eventually measure and tune it, and to think about the implications when designing your grammar, but beyond that the slowness of compilers is never driven by their parsers.

> you cannot get a compiler out of the door withot a parser

Technically, you can: you make your compiler something that consumes an already existing bytecode (like the JVM's or .Net's CLI) or only implement an AST parser that isn't intended for humans to actually write while you nail down the specifics.

> Syntax/grammar errors? You need the parser for that, don't you agree?

Error recovery in a parser is a tricky thing: it is intimately tied to the choices you've made in your grammar. The more redundancy and uniqueness between scopes you can have, the easier it can be for your parser to identify where things have gone badly. For example, in python if you write

    if foo = bar:
        print(foo)
you have little information but can still infer that you likely wanted to compare equality instead using ==. But because Python has a very flexible grammar and semantics, the following is valid but likely not what was intended:

    class Foo:
        def bar(self):
            print("hi!")

    foo = Foo()
    foo.bar = "hi!"
If you make different more restrictive decisions on both grammar and semantics you can still provide tools to allow people to do this, by being more verbose, but people who do this by accident can be given excellent diagnostics.


> I mean, think about it for a second: while you cannot get a compiler out of the door withot a parser, you can indeed leave all the fancy optimization features out, don't you? And if you want to get a MVP out of the door, do you focus your effort in developing optional features or the basic, critical part that ensures that your language exists?

Exactly: an MVP. For an MVP the parser may indeed be "the single most important feature of building a programming language.". But even then it's almost definitely isn't.

The moment you "leave out fancy optimisation fetures" your language becomes at most mediocre.

Because between parsing (which is trivial) and "fancy optimisation features" (which come very late in a compiler life) there are a million things that are significantly more important than parsing:

- how do you do typing?

- how do you code generation (do you output machine code yourself or use a backend)?

- do you use an intermediate representation?

- what are the actual optimisations on performance (none of which have anything to do with parsing)? As an example, C++'s horrible header system has nothing to do with parsing, but slows compiling by hundreds of orders of magnitude.

- how do you do graceful error recovery and reporting?

- how are lambdas handled? how are scopes handled? memory layouts? GC/ARC/manual memory management? standard lib?

- a million other things

> Actually, you're quite wrong.

It was a hyperbole. Do you even understand how fast 1GB/s is? It roughly 10 to 20 million lines of code [1] Windows 10 is roughly 50 million lines [2]. So, at 1GB/s you can parse the entirety of Windows 10 codebase in under 5 seconds.

Even if your parsing speed is 1000 times slower (3 orders of magnitude slower), you can still parse ~10k lines a second. And I honsetly can't even begin to think how you can parse something this slowly. ANd if you do, it's still not a concern, because the rest comes from outside parsing, for example:

- can yo run parsers in parallel?

- do you need to parse the entirety of your project to begin compilation, or your files/modules are isolated and compiler can work on any of those?

Parsing is trivial.

[1] For example, this SO answer for an unrelated question creates a file with 150 million lines, and the result is 6GB. 150/6 = 25, I went for even lower numbers.

[2] https://answers.microsoft.com/en-us/windows/forum/all/window...


The thing about parsers is this: we have tools that let you input any context-free grammar and will spit out a parser for you. Getting a working parser for your language is trivial, even if getting a good parser is non-trivial (though even there, I hesitate to say that it's hard).

In terms of time investment, building a parser takes like an hour whereas the semantics for the same language would take a hundred hours. Even when I throw together a parser for a simple language for some project, I spend more time trying to wrangle the output of the parser (that is, the semantics) than I do on the actual parsing portion of it. All of that effort in the Dragon book explaining how to build an LL(1) or LR(1) parser, or even how to build a lexer? Never used it. Almost completely useless. I spend more time thinking if I want to use a peek/next approach or a get/unget approach than handling shift-reduce conflicts or building first/follow sets.

Of everyone I know who works in compilers, I can't think of anyone who doesn't think that the Dragon book spends too much time on parsing.


So it is a different world now than in 1996. Parsers are much more solved now than they were back then.

I've written production compiler, but well before the Dragon book was written. I found it valuable, but kind of after the fact.

The references that we used included "A Compiler Generator" by David B. Wortman, Jim Horning, William M. MacKeeman. There weren't that many other parser generators available at the time.

Lots of valuable work has been done to make parsers easier to write. At MWC, dgc (David G. Conroy, author of MicroEMACS, which is now 'mg' on most systems) said that "YACC parsers make the hard part harder and the easy part easier". He wrote the C compiler with a hand-crafted parser. Starting in assembler, then bootstrapped it to C. Of the folks you know that write compilers, I wonder if they started writing after 1996, or even 2006.

If you are simply a user of parsers, then maybe you don't need to know how they work.


This is an interesting comment. What's the book that's like the Dragon book that comes closest to getting it right?


> I disagree. Parsing is the single most important feature of building a programming language. It influences how the language can be structured and also how it should be structured for performance reasons.

Wirth already settled that question (of both structuring the language and performance) with recursive descent over thirty years ago. The necessary treatment of all things parsing is then covered on twenty pages of his Compiler Construction text.


> Wirth already settled that question (of both structuring the language and performance) with recursive descent over thirty years ago.

"Recursive descent" is a whole class of parsing algorithms. Claiming that recursive descent settled the question is a kin to claim that compiled programming languages settled the question of how to develop programming languages.

Moreover, even if you go as far as believing that a specific class of parsing algorithms settled anything, it's absurd to argue that people who are learning how to write a parser should not learn about them, specially as your personal choice has fundamental limitations such as left-recursice operations and very specific techniques to mitigate them.

And finally, we still see articles being accepted into academic journals on how to parse concrete languages such as JSON. I'm sure we can agree that slapping together something with a random parser generator is not something that servers the students' best interests.


It turns out that whatever limitations there are to that approach hardly matter in practice, since Wirth developed a wide variety of languages this way, and they work just fine.


> It turns out that whatever limitations there are to that approach hardly matter in practice (...)

How do you know if you do not learn about parsers?

Do you see where I am going with this?

It's absurd to claim that parsers are not fundamental and central to develop compilers. That statement is patently false, specially considering that some programming languages are notoriously challenging to parse.


> That statement is patently false, specially considering that some programming languages are notoriously challenging to parse.

Sure, if you create an artificial problem in the first place, you need an artificial solution for it. That's why careful language design is even more important than parsers.


> no longer recommend The Dragon Book as it's very outdated

Only if they didn't realize there is a second edition from 2006.


Perhaps Cooper's and Torczon's "Engineering a Compiler (2nd Edition)" should be among the entries in the "Books" category.


Alan Holub's 'Compiler Design in C' is not mentioned in the books. Holub's book had well-written code. I had learned a lot from his code.


And a free copy of this book in PDF has been made available by the author: https://holub.com/compiler/


Latest versions of the ABI specifications linked in the Machine Specific section

ARM: https://github.com/ARM-software/abi-aa/releases

x86-64: https://gitlab.com/x86-psABIs/x86-64-ABI (go to most recent CI job and download artifacts for a compiled PDF)


Are there any good resources for building a runtime and codegen for languages with garbage collection?


The Garbage Collection Handbook is excellent, but might go into more depth than you need. For a simple garbage collector there needn’t be any effect on codegen, all you need to worry about is the runtime side of things. Search for how to build a conservative garbage collector and you should find plenty of tutorials and introductory CS courses on the subject.


For a basic GC, code generation should be unaffected. The only place you'd need to care about code generation is if you wanted to make a realtime GC, inserting read or write barriers.


My understanding from doing a bit of reading is that there are at least two things codegen-adjacent that an accurate collector will want (even if you stop the world while collecting):

1. The locations of pointers inside of allocations.

2. The locations of pointers in the stack (into which you may need to spill registers).

I can think of ways to do these (e.g. perhaps each stack frame contains a bitmask at a certain offset), but I'd love to learn what the best practices are.


Usually your objects are boxed, and the box contains a type identifier which the gc can use to identify pointer offsets. It is apparently possible to do precise garbage collection without boxes—I'll see if I can find the paper and link it later—but this is not widely done.


One thing I find curious: lots of compiler writing seems to be done in Haskell, but I can hardly find any Haskell-based resources for writing compilers.

Can anyone here please recommend any such resources?


This is a pretty good tutorial that I know of: https://www.stephendiehl.com/llvm/. The author went on to write a book too, though it's still in progress http://dev.stephendiehl.com/fun/



>lots of compiler writing seems to be done in Haskell

those 4fun of real world?


Most compilers are written for fun. That’s the point of the thread.

Also, there is a lot of modern language research is done from the “lens” (heh) of Haskell, or at least the Glasgow school.


Any similar resources on writing interpreters?

Also is there are a technical difference between an interpreter and a VM?


I haven't read it, but Bob Nystrom's Crafting Interpreters has been on my list for a while:

https://craftinginterpreters.com/


It's a great book. Bob's not a CS teacher, and it's great: the book is grounded in practice and has been structured in a way that lets you get a tangible result at the end of every chapter.


Don't wait, read it. It's fun and informative, worth every minute! I haven't had as much fun reading a technical book in years!


I second this. It's a great resource, very well written. Opinionated enough to have character and humour, but not so dry as to make your eyes glaze over. I guess I'm saying that for a technical tome it's very, very readable (as was his Game Design Patterns book incidentally)


A virtual machine executes instructions while an interpreter takes source code as input. An interpreter could be built on top of a virtual machine, obviously, but not necessary. For example, SICP/PLAI/OOPLAI[1] explain how to implement an interpreter on top of Scheme where you directly interpret the s-expressions. These may be a worth read if you want to learn about the usual concepts and techniques used in programming language implementations from a more general point of view. Like, for example, how to implement a static type system or an object model based on classes. Interpreters based on a virtual machine are actually compilers on the fly; the source code is not interpreted but translated into instructions beforehand.

[1] http://sarabander.github.io/sicp/

http://cs.brown.edu/~sk/Publications/Books/ProgLangs/2007-04...

https://users.dcc.uchile.cl/~etanter/ooplai/


Check out Peter Norvig's neat BASIC Interpreter https://github.com/norvig/pytudes/blob/master/ipynb/BASIC.ip...


Interpreters usually execute by walking the abstract syntax tree at runtime, bytecode VMs usually execute actual bytecode similar to machine code. The bytecode for such VMs is often way simpler than the instruction set for an actual computer, so it's far more portable between platforms.


Keep in mind that bytecode interpreted languages (Ruby, Python) are typically called interpreted languages. Java is usually called "compiled" because of the explicit step vs. Ruby and Python, but it's essentially the same. And typically you'll find discussions wrt to JVM about interpretation in Java referring to bytecode interpretation vs. compiling being JIT compiling.

Ultimately limiting interpreters to AST interpreters is not quite correct. The AST is just another IR that source code needs to be converted to just like bytecode. And the AST is essentially also executed by a virtual machine. Interpretation of the IR (the AST, or bytecode, etc.) is one part of a VM. Of course in some instances the VMness of a runtime is more pronounced (Java) than in others (Ruby).

The difference between interpretation and compilation is that the latter is meant to run on real hardware vs. the former implies something that executes the IR of a program by dynamically choosing which machine instructions to run.

Of course a compiler is also something that takes in some code and outputs some other, typically more low level representation.

My point being I don't think there is a strict or consistent definition these days for what an interpreter is.

Case in point: I've also heard people say interpreters read the code "line by line" (or rather, not that but more accurately, as far as they need to read before they know what to execute next) and execute it piece by piece. Which might be how some interpreters worked in some languages in the past. An AST interpreter OTOH already implies some whole source file pre-processing. Is an AST interpreter then less of an interpreter than one that streams the source code line by line. Is a program that does that more an interpreter than another which goes a step further and, e.g. executes a simplified AST, or one that executes a straightforward trivial bytecode representation of the simplified AST?


> Interpreters usually execute by walking the abstract syntax tree at runtime, bytecode VMs usually execute actual bytecode similar to machine code.

This isn't the right way to separate these concepts. A VM that executes bytecodes by interpretation is also an interpreter (Python, Ruby, PHP are well-known examples). Many VMs don't interpret, or they interpret mostly during startup, but actually try to compile hot parts of the code and execute that machine code rather than interpreting (Java and JavaScript VMs are well-known examples).

The VM part more properly refers to things like data representation and memory management. Interpreters of an abstract syntax tree, interpreters of bytecodes, and just in time compilers all need to run inside a system that takes care of loading code, allocating memory when requested, and often doing garbage collection. These services are what make a VM. The exact execution strategy is not what determines VM-ness.


Thanks for clearing that up.

Which languages or implementations of languages directly interpret the AST without the intermediary bytecode compilation?

I know Python, Java and JavaScript (V8 and SpiderMonkey) all compile to bytecode first probably to speed up subsequent runs and run some optimisations on the bytecode.

What other benefits are there to compiling to bytecode first vs directly interpreting the AST?


One major benefit of compiling to bytecode first is that bytecode is a more convenient shared source of truth.

For example, SpiderMonkey has two interpreter and two compiler tiers. The output of the parser is bytecode. We can interpret it immediately, and it's a convenient input to the compilers. It also simplifies transitions between tiers: for example, when we bail out of Warp, we resume at a specific bytecode offset in the baseline interpreter.

I'm not sure how you would resume at a particular point in the execution of an AST-based interpreter without agreeing on a linearized traversal of the AST, which is 90% of the way to bytecode.


Perl is a tree based interpreter. It's not exactly an AST anymore but the time it's being executed, but close enough.


If you compile to AST and walk that then your portability is at the source level; you have to send the source over the wire; each target then needs a parser+walker and each target needs to parse+walk. If you compile to bytecode you can send bytecode over the wire and then simply interpret that bytecode.


Portability. Say I wanted to make language X run on all platforms, but I didn't actually care about compiling it on all platforms. I can just write a relatively simple VM for each platform. This is one of the reasons Java was and still kinda is so ubiquitous


Wouldn't writing an interpreter for each platform be less work and achieve the same goal as writing a VM for each platform?

Edit: ^Aside from being able to execute the bytecode on any platform


Why would it be less work? The interpreter will need to implement whatever operations a VM can perform, so a priori it's at least as much work. Bonus, if you can bootstrap the source->bytecode process, then you only need to write (and compile) that once to get a full-fledged interpreter on every host with a VM


As others mentioned, source code should be distributed that way, and I think creating a simple VM is easier than a simple language parser. But of course, an optimizing one can be really quite complex in both cases.


Gambit Scheme, CLISP, CMUCL are capable of interpreting ASTs, and I believe (although I'm not 100% sure) that this is the case for Lispworks and Allegro CL as well. Also, Ruby until version 1.8.


I've toyed with compiler stuff for years. My current idea is a fluent/piping language. Is there any reference that would help with that?


There is a more recent edition of "Compilers: Principles, Techniques, and Tools" worth considering: https://www.amazon.com/Compilers-Principles-Techniques-Tools....


Does anybody know of good introductory resources on error recovery techniques when writing a parser & interpreter? Bonus points if they use combinators instead of a hand-written parser.


Not a resource, but depending on what kind of parser you are doing, maybe an idea.

The thing I'm going to try out next for one of my batch parsers (written recursive descent style as many or most parsers in the wild) is to have AST nodes of kind "invalid". The idea is that the parser should always complete, with minimal in-line error checking, even when I have say "require_token_kind()" or similar calls in my parser code.

Maybe not even an "invalid" kind, but an additional "invalid" flag, so that even those invalid nodes can still have all the regular node-kind-depending structure on them.

The other obvious idea is to use a parser generator. But I'm pretty sure I wouldn't be happy on that route - from the little experiments that I've done, I infer it's too bureaucratic, too much boilerplate for me to suffer.

The more general advice for error handling is that you should try to make APIs that require very little in-line error handling. Instead, keep errors behind the APIs, maybe collect them in a list, or only keep the first or last error, and allow to handle them at a later point (out-of-line).

And boom, most of the talk about smart syntax and type systems and other clever systems to make error handling more convenient suddenly becomes obsolete.

Yet more abstractly, don't try to build systems that make cumbersome things more convenient, but think how you can do with less of these cumbersome things. (That works wonders for runtime performance, too).


>The more general advice for error handling is that you should try to make APIs that require very little in-line error handling. Instead, keep errors behind the APIs, maybe collect them in a list, or only keep the first or last error, and allow to handle them at a later point (out-of-line).

I guess you wanted to say that instead of printing just what's wrong at some point, be smarter and try to build some handier error handling like attaching error to ast's node and then maybe do something with it


The important point is the temporal decoupling between the time where the error is detected and the time where it is handled.


I've actually been meaning to write something up about this. I've found you can get fairly good error messages from a parser combinator style parser just by saving the deepest location you successfully parsed to and the parser you failed in.


Wish there were more resources about or tools for bidirectional parsing.


ProTips after writing numerous kinds of compilers in university:

Recursive descent with backtracking is much easier to construct and maintain for diagnostics than bison/flex.

Derivative parsing is very interesting.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: