I am 40 now; back in 2004-2006, for two years I was one of the youngest professors in Italy, teaching "compilers and programming languages".
I still feel so fortunate to have had that experience.
To help my students save the relatively huge amount of money to buy the dragon book(s), I created a condensed version of the parts that were required for the course - and I didn't charge anything for it, unlike what usually happens pretty much everywhere in Italy. They were available in PDF and OpenOffice formats, on the website that I created for the course (yes, the CS department didn't really have a proper website to use as CMS - I kid you not).
I was T.A. for COMP-520, Compiler Design, at McGill University for two years, and it was the best job I've ever had. The compiler project had not been updated in 10-12 years; my adviser trusted me to make a new, interesting project. I was paid by the department for 5 hours of work per week, but I poured in more than 35 per week to design, document, and implement a subset of Go
My hard work paid off immediately. Students were more interested by a language that was new, modern, and used in the industry than by the previous language that was used, Wig. As a result, enrollment jumped from 12-14 students to 45. Though it was a subset, and much of what makes Go interesting was removed (concurrency, interfaces, GC) it was still an extremely demanding project. Many times I thought that the project was too big, that we needed to cut back, but the students were real troopers and chugged through (by the way, when did 20-year olds become so smart?! I was never this bright when I was their age!) and all teams completed the project and made me very proud.
Greece. I learned a lot in the process having just an undergraduate degree at the time. The students didn't appreciate it much unfortunately (wasn't a university but a vocational school; don't ask why Tanenbaum was on the curriculum). Just one came to me after the exam (I had to pass them all anyway) and congratulated me of making it so hard for them (him).
Ah, it was online until about 2007 or 2008. You might be able to find it with the wayback machine (it was something like www.wedoit.us/labcompilatori). I might have it in my old backups, I will check.
BTW, can anyone please suggest a good online compiler course for creating your own programming language with labs/exams/certs? I can find pretty good courses for almost anything but not for compilers... Something that would focus on handling grammar (LALR, LL, CYK etc), ASTs, semantics, types, code-completion, imperative/OOP/functional/logical/self-modifying constructs etc. or even NLP-to-AST conversion...
My compilers course has all the lectures recorded, and has starter code for all the assignments provided. It provides all the parsers, and focuses on the backend, including memory management and various details of value representations.
This is just my materials, I don't have an associated platform with verification/certification. However, some folks from HN have previously done the course and emailed me after, and seem to have gotten some value out of it.
I'm using the earlier offering of your course and another[0] to learn about compilers for a while now and it's the only method that really clicked for me. Thank you very much for making this open!
Matt Might doesn't have an organized course for free online, but I found that his blog filled in a lot of the gaps for me between compiling a language like C, and a more modern language like Scheme or Python, which are missing from other resources I've found online. You'll have to click around to different posts, but here's a good starting point:
Most of this stuff flew over my head. While I understand C, this stiff feels out of grasp. Can anybody also please suggest recommended reading for beginners?
Pet peeve about compiler education that is not repeated by this article (yay, well done!): the way you have to wade through months of parsing bullshit to get to the really interesting stuff (optimization and static analysis).
"the way you have to wade through months of parsing bullshit to get to the really interesting stuff (optimization and static analysis)."
If you look back in the late 70's, you'll discover something interesting: Almost everything was parsing of one form or another.
Really.
Parsing of flowgraphs, etc.
Dataflow was done not just by bitvector methods (which were fairly new), but by parsing of graphs using graph grammars.
Most optimization analysis are very closely related to parsing (and can often be solved by graph reachability). The complexity bounds of most of them are also driven by parsing complexity results.
For example, andersen-style points-to analysis is solvable as a context-free-language problem. The time bounds of it are driven by the time bounds on context-free-language parsing. In particular, all-pairs reachability on Context Free Languages (N^3).
etc
Parsing is the basis of compilers for a very good reason. The fact that we have advanced other types of methods on top of analysis for optimization and analysis does not change the fact that, at the core, a lot of them are based on equivalence to parsing of graphs and graph reachability.
Another way to look at it: If you ever want to be able to be really good at compiler optimization, don't skip parsing. Otherwise you will miss the connections that are really important to being able to design efficient new algorithms, instead of just learning old ones.
Parsing also comes up in a stupidly large number of other areas within programming, including: serialization formats, network protocols, DSLs, log analysis, time series interpretation, NLP, unstructured data mining, and probably several I'm not familiar with. Fundamentally understanding what's going on when you're doing recursive-descent, shift/reduce, or combinator-based parsing unlocks a big toolbox that you can use to solve a number of other problems.
I'm not saying it isn't useful - I'm saying it's not interesting (and I'm talking about the act of parsing, not the algorithms as applied elsewhere). Think about this: you start learning compilers, you spend months on parsing, and after that what do you have? Nothing - you "compiled" stuff into an AST. Yawn.
"I'm not saying it isn't useful - I'm saying it's not interesting "
Then you are doing it wrong :)
"(and I'm talking about the act of parsing, not the algorithms as applied elsewhere)"
I'm not sure what this is supposed to mean. It sounds like you are trying to separate parsing text into ASTs and parsing flowgraphs, so you can claim one is boring, and the other is "algorithms as applied elsewhere". There is no such separation.
We just happen to name variables a little differently so it doesn't quite look like "figure out if this one parentheses matches this other one". But it is the same algorithm.
If you are trying to separate parsing as the thing that happens that makes an AST, that is not the only thing that is parsing. Yes, some books colloquially separate phases like that, but it's silly.
"Think about this: you start learning compilers, you spend months on parsing, and after that what do you have? Nothing - you "compiled" stuff into an AST. "
You are definitely doing it wrong. The same parsing of a grammar you taught someone to make an AST, you can now say "we can apply a slightly different language grammar to the AST and generate machine code", or "apply a slightly different language grammar to the AST and parse out flowgraphs and statements"
etc
parsing isn't just take a bunch of text, turn it into an AST.
The fact that a ton of people think this is one reason there are so many crappy little compilers that never go anywhere :)
We're having two different conversations here. You are saying (correct me if I'm wrong) that the algorithms you use for parsing are broadly applicable to program analysis. You are not wrong. It's just orthogonal to my point, which is that program analysis is really interesting.
To make an analogy here (from Lockhart's lament: https://www.maa.org/external_archive/devlin/LockhartsLament....), you are saying that color theory is massively important throughout one's career as a artissti. I am saying that I just like painting things.
Sure, there's lots of cool shit in parsing theory no doubt, with tons of broad application. But the way that we are taught it ("here's how we get an AST from written text") turns away people who are into program analysis. As such, we shouldn't consider parsing a pre-requisite for program analysis, regardless of how interesting or applicable the theory of parsing is.
I think DannyBee's using a slightly broader definition of parsing than you're used to. There's the narrow definition of parsing ("Here's how we go from a sequence of bytes to an AST") that's the concrete material taught in compiler courses. But there's also a much broader paradigm where "parsing" is broadened to "Here's how we go from a listed set of patterns, possibly overlapping, to a set of transformations defined by those patterns." The source input need not be a linear set of bytes; you can trivially extend it to trees by taking an in-order traversal of the tree, or even operate directly on tree or graph data structures.
With the broader definition, a large number of what you consider the interesting parts of program analysis are actually parsing problems. Instruction selection, for example, involves tiling the AST (or more typically, the IR) with a minimal set of machine instructions. In many compilers, this is implemented with a machine-definition file that describes a grammar for transforming the IR into machine-language instructions. Peephole optimization involves recognizing certain patterns (eg. division by a constant power of 2) and then transforming them into more efficient constructs (eg. shift-right). This can also be described by a grammar on the IR.
I'm not sure if this is explicitly taught in compiler courses - it wasn't really in mine, but it was alluded to - but it's a really powerful way of thinking of the passes in a compiler. It lets you unify concepts, so that instead of memorizing all the specific algorithms used, you can think of a pass as "this is a transformation from this specific input language to this specific output language".
As much as I enjoyed optimization and static analysis, parsing is probably the single most useful knowledge I got from my formal CS education; and probably the only part of a compiler that most CS students will ever need to know.
Yes, I think understanding that there's a fundamental difference between the operation of parsing a language and "naive call-and-for-loop programming" is very important. Most naive efforts to parse a language starting with no knowledge simply fail (and that can include, say, things efforts to validate a file name in a somewhat complex file system or to parse a packet in a network protocol).
In college, it took me a bit of time to understand what a formal language really was and what the difference between a parser and a simple string processor was and when I arrived in industry, I notice this were something a certain proportion of otherwise experienced programmers simply did not understand this.
The symbiotic relationship between compiler development and hardware architecture design has a defining role in CS problems, and while you don't need to understand that relationship to write CRUD apps for a living, I think ignorance of it robs those who really want to understand this stuff of truths both technical and profound.
There are alternatives to this mainstream style of computer architecture, such as dataflow or hybrid quantum-classical systems, which both require a very different kind of compiler.
A recursive-descent parser is probably the only part of a compiler which people will easily derive on their own, even without taking a compiler course.
Funnily enough it's also where a bunch of mature compilers eventually ended up with, because of speed, heuristics and so forth. Like GCC, which used to have bison based compilers IIRC, but but is hand rolled rec dec...
It's not so much for speed and heuristics but for error reporting and/or recovery.
As it turns out (and I've tried many times myself too), designing a parser generator that will retain a compact representation without losing good error reporting is very hard.
E.g. consider a BNF type grammar. Almost every symbol represents a potential error. But often deriving that error message automatically in a way that is helpful to a human is incredibly hard because the grammar lacks information about which mistakes are more likely for humans to make.
So we end up with bad error reporting, or you end up having to write code to analyse errors anyway, and then a lot of the benefits disappear - the actual parsing is usually easy to do compactly with a recursive descent parser and a tiny library of helpers anyway.
Yea, I was imprecisely using heuristics to primarily be about error reporting. You essentially have to partially accept a wider set of grammar, to be able to throw more intelligent errors.
But speed ime also is a major issue. GCC sped up a good bit when switching, and we are seeing speed issues with bison generated parsers.
Ironically one of the benefits of bison is that you get feedback about whether your grammar is conflict free. Can be a significant advantage over a hand built recdec parser for a complex language like SQL.
Speed certainly can be a nice bonus, but you can also generate recursive descent parsers fairly easily if you have a suitable representation. My own parser generator attempts did that.
The feedback certainly is useful, and I'm all for people using such tools to validate grammars for testing. Though people manage to abuse that by writing extra code too (Ruby being a prime example of using a parser generator and then abusing the hell out of it in ways that complicates the grammar).
I really wish more work would go into research on automating error reporting, though. As much as I handwrite most of my parsers now, I wish I didn't have a reason to.
> I really wish more work would go into research on automating error reporting, though. As much as I handwrite most of my parsers now, I wish I didn't have a reason to.
I recently made myself sit down and learn leex (essentially lexx for Erlang), and I found that to be an extremely useful exercise. It helped me understand how to think about what kind of inputs a parser can make use of, and when it makes sense to skip the tokenizing step.
I know it might be a contrived case, but it's rather disappointing to see that even with the -O2 optimisation level, the final generated code contains 43% (3/7) useless instructions to mess around with RBP. I mean, it should be bleeding obvious to the compiler that this function doesn't require any stack allocation, much less an RBP-based one (the tradeoff between using RBP/RSP-based allocation is far more subtle, and I can forgive a compiler for choosing nonoptimally in that case), so why did it emit those useless instructions?
There's plenty of articles about how intelligent compiler optimisers can be, and I don't doubt that they can do some pretty advanced transformations, having read a lot of compiler output; and then, I see things like this, which just make you go why!?!? It's puzzling how compilers behave, applying very sophisticated optimisation techniques but then missing the easy cases. The effect might not be as great as more advanced optimisations, but so is the effort required to do this type of optimisation. This is not even low-hanging fruit, it's sitting on the ground.
However, the compiler will use as few registers as possible.
Compilers usually try to use all the registers they can (although sometimes they do exhibit odd non-optimal behaviour, as described above); eschewing register use and using memory instead will increase code size and time.
The "useless" instructions aren't for potential stack allocation, they're there so backtraces work. If you don't care about that then you can add -fomit-frame-pointer and they disappear.
Compilers don't default to this because having backtraces on crashes and profiling are more useful than the like 1% speed gain from another available register on x86-64.
Are you arguing that they do work on ELF+DWARF with sufficient extra debug info, when running under a debugger, despite documentation that explicitly says it doesn't work? Because I certainly don't expect my users to have a debugger installed (let alone regularly run programs under one) but I'd still like to get useful crashlogs automatically.
Or are you saying that you could potentially reconstruct a backtrace, if someone wrote code to do so that is smart enough to locate and account for all the stack adjustments in reconstructing where each function stored its return address, including runtime alloca() / VLA adjustments and multiple adjustments in the same function (potentially needing to reconstruct the code flow), just from the disassembly? Because the linked MIPS code is nowhere near that smart, and to my and that author's knowledge, such code has not been written for x86.
Either way, nice in theory but not good enough in practice.
The debugging info is enough to find, for each code address, the offset to the return address on the stack, from which you can find that offset, etc. You don't need a debugger but saving the stack is enough to get a backtrace for a crashed process.
You can have a minified memory dump sent automatically, then symbolicate it on your servers using the stripped debugging symbols. That's what Breakpad does. I'm not sure how well it deals with alloca(), but I've used in on several shipped pieces of C++ software and it worked mostly fine.
Cheers to you for not just trying to force your own creation down people's throats. D is a cool language, but you're right, it's a bit complex for a beginner.
A Java compiler lacks optimization and code generation (generating IR is not code generation). So does Javascript. But they are both great languages for learning how lexing, parsing, and semantic analysis works because those operations are relatively simple.
C is a harder one because the multiple "phases of translation" add quite a bit of complexity. It's easy to get lost in the frustrating and pointless complexity of the preprocesser. C++ layers onto that mixing semantic analysis up with parsing, and an unusually complex semantic analysis. Not for beginners.
I suggest reading the source code to a professional compiler because there are books about how compilers work, and then there's how compilers really work. Pro compilers tend to be built differently from academic or side project compilers.
I don't understand why this still looks scary to me. A while back after learning regex (though only at a basic level) I thought, maybe I can make my own custom C preprocessor as an exercise. Perfect right? I get to choose the syntax AND the rules. But somehow I just can't write it. Nested ifdefs, defines, undefs mixing together too OP.
Reading through this, it seems like there's a huge amount of work into real compilers. Front end, optimizer, backend, etc. I do appreciate it more, but as useful as compilers are, maybe I'll just leave it to the pros (:
Don't ever do that! It really isn't that complicated at a basic level. Regex aren't the solution for parsing. What you're missing is the recursive nature of "context free languages"[1]. Formal language theory [2] was one of the big, early wins of Computer Science. There's a lot to it, but writing a simple recursive descent parser doesn't require any of it.
The big caveat is that parsing is only about 1/3 of the problem, the others as described in this article are optimization and code generation. And of these, optimization can be skipped entirely leaving only code generation, which can be naively done by walking the AST. If this all seems foriegn to you, there's a lot of really good info online nowadays ie people love writing about this subject.
Understanding this subject is very important in my opinion, this is one of the foundational things we do. Attaining a comfort level with compiler engineering is one the two or three things anyone can do to really "level up" as a software developer. Some other things are writing multi threaded servers in C, and 3d software rasterization. Again, my opinion here.
>Understanding this subject is very important in my opinion, this is one of the foundational things we do. Attaining a comfort level with compiler engineering is one the two or three things anyone can do to really "level up" as a software developer.
I think it depends on the level of abstraction.
Then again, I'm not really a software developer, most of the work I do is scripting stuff for data analysis, which is where I learned regex from. I did learn C in college which I somehow got fascinated with, but I never really do any serious work with it. Still, over time, reading about quite lower level stuff (compared to what I do) does seem like it helps take the mystery out of things like unintuitive behaviors with multiple references to a Python list (I suspect it was pointers all along). Taking it to the next lower level with studying compilers and how assembly gets generated doesn't seem like it will benefit me much more.
Still, I do like C enough to possibly consider doing something in it for fun if I get an idea, so maybe one day I'll come back to this.
Wow I enjoyed both the content and the design aesthetic of this post. I guess that's what happens what a designer turns software engineer? Please post more like this!
Compiler design depends a lot on the language being compiled though, this would be a typical design for a C-style language.
When doing Forth-style languages for example; I've found that it's not very helpful to parse to trees, since I can't tell the structure of function calls before seeing the actual running stack anyway. Which in turn affects the compiler, since all it has to work with is a sequential stream of tokens.
And not all languages are stand-alone, or compile all the way down to machine code. Compiling to byte-code for running immediately, with the backup of a runtime library; is an attractive option for more specialized languages; and this means that it's possible to carry more information straight over from compilation to runtime.
I've been working on a scripting language that raised some of these questions lately:
We've rehashed the debate over the utility of the dragon book in the 21st century so many times on HN that I'm not going to comment on it's usefulness. (Search if interested)
But, for those interested, I highly recommend Engineering a Compiler by Cooper and Torczon OR Modern Compiler Implementation in ML by Appel. I've read all three and would only recommend the dragon book to someone very interested in the subject.
The Monica Lam sections in the second edition of the Dragon Book are out of this world good. Also, she's an excellent lecturer.
Muchnick's book is still quite good.
A second semester compiler class is rarely taught and so everyone tries to squeeze too much into one semester. I almost wish they broke it in two and allowed students to take either half. LL(1) parsing isn't a prerequisite to static single assignment.
I love reading about compilers. It's fascinating. I have implemented my own PL/0 compiler in various languages as a way to learn those languages. It's fun without having to think about designing a new language.
Question: Does anyone know of some resources for adding concurrency to a simple stack based VM? I've been wanting to look into that for a while just for kicks.
> Some compilers translate source code into another programming language. These compilers are called source-to-source translators or transpilers
I was wondering how compile-to-JS languages would be classified. Technically they are all transpilers, but seeing as there is no lower-level for the browser than JS (barring WebASM), one could also argue that they are in fact also compilers.
> Technically they are all transpilers, but seeing as there is no lower-level for the browser than JS (barring WebASM), one could also argue that they are in fact also compilers.
They're compilers because they translate from one computer language into another. It has nothing to do with how low of a level the browser supports running. They're additionally transpilers (a subset of compilers), because the languages that they translate between are both high-level languages.
Although the following (from Wikipedia) seems to accord with usage:
> The name compiler is primarily used for programs that translate source code from a high-level programming language to a lower level language (e.g., assembly language, object code, or machine code) to create an executable program.
Maybe we should complicate that a bit by saying that a compiler typically compiles a programming language into a grammar where the higher-level features of the source language must be "expanded" or "linearized" in some substantial way to be expressed in the target language.
So, for example, GHCJS is more like a compiler than a transpiler, even though its target language is JavaScript, because the JavaScript that it generates is full of graph reduction stuff that is implicit in the source Haskell.
On the other hand, CoffeeScript is more like a transpiler, because it doesn't do any such substantial expansion or linearization.
Transpiler is a recently made up word, and describes something that's just as well described as a compiler. I wouldn't spend too much time hoping for a distinction here.
I'd thought a transpiler is a compiler where the source and target language are the same and the semantics of the program are preserved across transpilation (eg. Closure Compiler, Babel, LLVM optimization passes). That's a useful distinction: it implies that you can feed the output of a transpiler back into the input and at worse the result will be idempotent, and that you can mix transpiled output with raw input and the result will still be interpretable.
Evidently Wikipedia has a more broad definition of source-to-source compilation, which I'd agree is pretty meaningless. The distinction between high- and low-level languages is pretty arbitrary; you can write machine-language "source code", so in some ways all compilers are source-to-source.
I think they've gone out of fashion now, but when I learned about compilers we were introduced to T-diagrams, sometimes called tombstone or bratman diagrams and shown how to bootstrap a compiler for a high level language from scratch [1].
Torben Mogensen's Basics of Compiler Design [2] - Chapter 13 is devoted to bootstrapping a compiler, which introduces them.
[1] scratch meaning a computer (with relevant hardware) with a CPU with it's machine code instructions and some way to make files containing those instructions. You can do this even without assembler, that's the first step. Of course no-one would do this now, but it's a necessary step developing the first high level languages and compilers.
Can anyone help me understand what kind of employment are available in the market which requires deep knowledge of compilers?
I love this subject, but I don't know to which domains it is applied on outside of compiling languages like Javascript/JAVA/D/C/Rust and so on on into Machine code.
Thanks. I liked this blog post. I've questions. Why should I prefer LLVM? What is LLVM and how it is different from GCC. I understand, every compiler has three parts. Front end, Optimizer and Back End. Looks simple.
I haven't looked at the GCC code myself, but what I heard is that you can see that it's a codebase that grew massively over a large timespan and things that should be separated are not. At least a few years ago there were actually two different IR formats (the newer one being based on SSA) which are converted between each other several times. I think GCC's IR does not have a canonical text representation, this is really helpful for understanding and debugging in LLVM.
Still also LLVM has its weird parts, like their variable naming scheme
This was really good. I could imagine it being required background reading for a lab of the compilation pipeline in 61C at Berkeley. Add in a section on linking.
I am 40 now; back in 2004-2006, for two years I was one of the youngest professors in Italy, teaching "compilers and programming languages".
I still feel so fortunate to have had that experience.
To help my students save the relatively huge amount of money to buy the dragon book(s), I created a condensed version of the parts that were required for the course - and I didn't charge anything for it, unlike what usually happens pretty much everywhere in Italy. They were available in PDF and OpenOffice formats, on the website that I created for the course (yes, the CS department didn't really have a proper website to use as CMS - I kid you not).
You can find the material here, it's in Italian but it might be fun to take a quick look: http://www.lulu.com/shop/simone-brunozzi/dispense-lab-lingua...
Such good memories.