I think the incremental approach is terrific, because it allows you to get to a program the emits assembly and builds a working binary in week one. The first thing this does is give a concrete example of what "a compiler" is. The second is to provide a great foundation for discussing static vs. dynamic and what decisions are made at compilation time vs. runtime, without needing a full implementation. These concepts are not obvious (e.g. when can and should a check for unbound ids happen? What about divide by zero, or overflow, or type mismatch?), and deserve to be carefully taught and considered.
This lets the course build up a new feature, from front-to-back, each week or two, and consider its implications on the whole pipeline each time.
I find it a little amusing that it claims "toy compilers are too simplistic and do not prepare the novice compiler writer to construct a useful compiler", and then chooses Scheme, effectively bypassing the whole process of parsing... which IMHO is half the fun of writing a compiler.
Seeing the elegance of recursive descent (and refactoring it into table-driven operator-precedence) and how it can turn recursive structures like expressions and statement blocks into linear sequences of instructions is something I think everyone writing a compiler should experience.
However, I agree with an incremental approach, and that is nothing new; personally I think Jack Crenshaw's tutorial series (started in 1988) is one of the best:
It's interesting to note that Jack doesn't specialise in compilers, nor computer science for that matter, yet his explanations are far clearer to me than many of those who do.
> how it can turn recursive structures like expressions and statement blocks into linear sequences of instructions
I think this is one of the key concepts in a compiler, yet I don't think it has much to do with parsing.
Parsing (in a compiler) is the act of taking a linear sequence of characters or tokens, and turning them into a rich AST structure.
The back end of the compiler is what takes this recursive structure and flattens it out into a linear sequence of instructions that "mean" the same thing as the tree.
Parsing is critical, of course, for building a working compiler. But I think not focusing on surface syntax gets to the key recursive-to-linear insight you mention more directly!
This is not a necessary step, and is specifically not done in a lot of simpler compilers. E.g. most "Wirth-type" compilers never build an AST.
There are lots of good reasons to build ASTs if you want to do complex analysis and optimisation, and it simplifies the presentation of some things, but calling the code generator module straight from the parser in many ways makes the direction connection clearer.
effectively bypassing the whole process of parsing... which IMHO is half the fun of writing a compiler
Fun for compiler writers, perhaps ;) I am a big fan of languages with minimal syntax because complicated grammar affects the user as well. The machine may do the hard work of translating the source code, but the user still has to do some degree of translation of their own in their head while reading it.
I view source code as a kind of interface, and I like simple, clean interfaces.
>> It's interesting to note that Jack doesn't specialise in compilers, nor computer science for that matter, yet his explanations are far clearer to me than many of those who do.
Yes. It's written with a practical approach in mind. Once you get that idea, feel free to study compiler textbooks like "the Dragon Book",for example, for in-depth theory.
"Abdulaziz proposed an incremental approach in which the students build the compiler in stages; they start by implementing a complete compiler for a very small subset of the input language, then in each subsequent stage they add a feature to the input language and add or modify passes to handle the new feature [...] This is the textbook for the incremental version of the compiler course at Indiana University"
Thanks, this a terrific reference that I hadn't seen!
I see a few interesting contrasts with the my course off the bat:
- I skipped forcing students to implement uniquifying variables early on. I wanted to get from source to assembly as quickly as possible, but this may have been a worthwhile step to force, since it teaches some valuable lessons.
- I had them implement ANF (AKA flatten in those notes) in a different style, because I was used to it. The style in that textbook is better (Ranjit Jhala at UC San Diego also pointed this out when he taught a version of the course), where an expression is turned into a list of bindings and a final expression, rather than using a continuation-passing ANF algorithm as I did.
- The linked book has instruction selection with semi-abstract addresses followed by a "patching" phase, to avoid instructions that, say, move from stack location to stack location. This is cool, but is a few more steps than I wanted to get into for the simplest compilers. I didn't pursue as structured an approach, and instead gave a simple description to the first few "compile" functions we wrote:
Generate a list of instructions that gets the "answer"
for this expression into EAX
This avoided detailed discussion of instruction selection, abstract vs. concrete addresses, etc, in the beginning, and just made students generate some instructions that work. We quite quickly after that needed to start talking about issues of clobbering other expressions' state, and finding unique locations for variables, but then those just became constraints on "getting the answer into EAX" correctly. In fact, I found that repeating the mantra "get the answer into EAX" was a nicely actionable way to go about generating instructions, and letting students get started when we introduced a new feature (to compile it, we need to get the answer into EAX!)
After we'd done this a few times, we had the shared vocabulary to realize that answers don't _always_ have to go to EAX (the compiler can be parameterized over where the current answer should go), and not every constant's value needs to flow through EAX in order to get to its variable's home, etc. But these were refinements on top of our dead simple strategy.
Just some thoughts I had while perusing the first few chapters here. Thanks again for sharing!
Is this preceding work to the tour de force that is the Chez Scheme incremental like 40+ Scheme compiler. I mean, Cisco bought it, probably to hide it from us all. It must be worth something.
As someone pointed out, yeah, open-sourced after bought. There used to be variations. One, compiled, Chez Scheme, and the latter interpreted with none of the benefits of the compiler design but still fast and well desiged in relation to other Schemes, Petite Chez Scheme. The latter was freeware, the former, neither free/open nor freeware.
It was never clear why Cisco bought it. Rumors persist ...
I've just realised that I have confused this paper with another paper, which I think was from the racket people and involved what they called "nano-passes". I think I've actually glossed over this paper many times for this reasons.
I'm not interested in writing a scheme compiler (there are several good ones out there already), but this looks like it has a lot of useful techniques I can pilfer.
https://www.cs.swarthmore.edu/~jpolitz/cs75/s16/index.html
I think the incremental approach is terrific, because it allows you to get to a program the emits assembly and builds a working binary in week one. The first thing this does is give a concrete example of what "a compiler" is. The second is to provide a great foundation for discussing static vs. dynamic and what decisions are made at compilation time vs. runtime, without needing a full implementation. These concepts are not obvious (e.g. when can and should a check for unbound ids happen? What about divide by zero, or overflow, or type mismatch?), and deserve to be carefully taught and considered.
This lets the course build up a new feature, from front-to-back, each week or two, and consider its implications on the whole pipeline each time.