Hacker News new | past | comments | ask | show | jobs | submit login
Writing a C Compiler, Part 1 (norasandler.com)
369 points by sidcool on Dec 1, 2017 | hide | past | favorite | 60 comments



Most compiler tutorials show how to write a compiler, but now how to write an optimising compiler. In fact, it's not that much harder, and is a nice walkthrough in some very cool ideas in graph theory, dataflow analysis,etc.

Which is why I've been writing a mini-LLVM in haskell as a tutorial to explain optimising compiler construction: [Tiny optimising compiler](https://github.com/bollu/tiny-optimising-compiler)


There is something to be said for having LOTS and LOTS of non optimizing C compilers written in many languages. Even languages like, say, Python. And plenty of C compilers written in C. A virtue of being non-optimizing is that they are easier to study and verify correct operation.

What we need is that the source code for the smaller number of optimizing compilers can be compiled by the many, many different inefficient C compilers.

We need to be sure that the binary of the optimizing compiler is not compromised by the compiler which compiled the optimizing compiler.

You could compile compilers. Including compiling various compilers written in C. At some point you compile the wanted optimizing compiler's source using various different compilers generated through various chains of compilation. Those binaries of the optimizing compiler may all differ, but the code generated by those multiple binary versions of the optimizing compiler should be identical.

Finally use the uncompromised optimizing compiler to compile itself. In fact, using the multiple binaries of the optimizing compiler to compile itself and be sure they all produced the same final binary.

Cross compilation doesn't hurt either. For example a Raspbery Pi compiles the optimizing compiler for x86-64. The binary of that compiler should still match the same compiler compiled from other C compilers.

Now we not only have to be paranoid about the binary of a compiler being compromised by another compiler, but by the Intel Management Engine. Compromise baked right into the hardware.

Every paranoid thing I thought ten years ago turned out to be true.


See Ken Thompson's award speech where he talks about the original C compiler backdoor:

http://wiki.c2.com/?TheKenThompsonHack

http://vxer.org/lib/pdf/Reflections%20on%20Trusting%20Trust....


See David Wheeler's work on "Diverse Double Compiling". What the parent describes seems to be similar.

https://arxiv.org/abs/1004.5534


> Every paranoid thing I thought ten years ago turned out to be true.

That seems like confirmation bias. But yeah, good idea about compiler integrity.


You want verifiable builds, but that different algorithms should produce the same result.

Is this even possible?

And wouldn't it be easier to just verify the binary manually?


I've not been able to confidently follow what the parent wrote, but if they're discussing diverse double compiling, the way it works is that you compile the same source code (S) with a bunch of different compilers (C0...CN). This gives you a bunch of executables (E0...EN) which are the compilation of S. Because C0...CN are different compilers, E0...EN will probably differ, bitwise. But because behavior should depend only on the source code, if those compilers are sufficiently correct E0...EN should behave the same. So if you compile S with E0...EN to get (say) E0'...EN', those should all be bitwise identical.


That's right. Details, demos, and formal proofs about diverse double compiling (DDC) can be found here on my web page: https://www.dwheeler.com/trusting-trust/


You said it way better than I did.

Also doing cross-compilation on different platforms should produce identical binaries, are at least identical behaving binaries.


A way to do this more directly and without dealing with any assembly syntax and at a deeper level is to compile down to an emulation of a straightforward machine (I really like MSP430 for this), preferably to an emulator you write.

Having had the pleasure of writing a couple emulators and a couple compilers at this point: the emulator is easier than the compiler, by a lot. Relative to parsing and emitting code, emulating a straightforward architecture is easy.

This is a great post!


I agree completely, and I would add that the compiler doesn't even need to compile down to a system/architecture that actually exists. You can make your own simple 'architecture' with a minimal list of instructions that wouldn't require much more then a loop and some switch statements to emulate, and you're free to design it in any way to make your life easier. For an architecture that's designed to be targeted by C, you likely don't even need registers and could just use memory operands for everything.


There's also a reasonable chance that if you're interested in compiler design and making an architecture you're the sort of person who would enjoy learning a hardware description language. With a cheap FPGA dev board you could then implement your toy architecture and run your compiled programs on that. This adds substantial difficulty to the project, but is damn fun.


Yet another approach which I have used in Fur (https://github.com/kerkeslager/fur) is to compile to a low(er)-level programming language (in the case of Fur, I'm compiling to C).

These approaches aren't just for toy languages. Java/C#/Python standard implementations compile to byte code that runs on virtual machines. Glasgow Haskell Compiler compiles to C--, and a bunch of languages compile to JavaScript.


The approach taken at my compiler course was one that I consider the best one for toy compilers, but it does mean having to deal with Assembly.

Do the generation to some kind of byte code as you are suggesting, but using an instruction format that be used as macro calls in macro assemblers.

Then just write the macros for each bytecode, doesn't matter matter if the register usage is bad, goal is just to have a plain native executable that runs.


> is to compile down to an emulation of a straightforward machine (I really like MSP430 for this)

What exactly do you mean by this? Direct object code emission?

> the emulator is easier than the compiler, by a lot

Yeah, definite +1 there. For most simple machines it really is not much more than a loop with a switch on the opcode where the cases update the CPU state.


Yes: just spit out opcodes.


Any reference(s) where I can read more about writing emulators?


Here is one https://fms.komkon.org/EMUL8/HOWTO.html I found quickly.

Writing an emulator is almost as much fun as writing a compiler. And when you are done, you will have another level of appreciation of how computers work.


Wirth came up with a simple RISC architecture for his compiler book. There's an emulator[1] for it, written by Peter De Wachter. Michael Schierl adapted Peter's emulator code into Java and JS. You can run it here:

http://schierlm.github.io/OberonEmulator/

I sent a bunch of patches to rework the in-browser emulator a couple weeks ago. If you don't know C, I recommend reading through the JS emulator's source. (View source should suffice—it's all unminified vanilla JS; there's no opinionated JS framework involved.)

With it running in the browser, the emulator frontend treats the web platform as its widget toolkit. The code to interface with that is in webdriver.js[2] and takes about 1000 lines of code. The CPU and memory operations themselves are implemented in risc.js[3] and only take about 1/3 that.

To follow along with instruction fetching/decoding/execution, you'll need to understand the ISA. There's a good 3-page overview linked from projectoberon.com under the title "RISC Architecture"[4]. A more in-depth description of the design is also available[5].

I have some tentative work for a machine-code level debugger online[6]. It's unfinished, however, so it comes with no documentation and the toolbar icons are missing. (There are tooltips, however.) So you can play with it if you feel like watching the registers and flags change while stepping through machine instructions.

1. https://github.com/pdewacht/oberon-risc-emu/

2. https://github.com/schierlm/OberonEmulator/blob/master/JS/we...

3. https://github.com/schierlm/OberonEmulator/blob/master/JS/ri...

4. https://www.inf.ethz.ch/personal/wirth/FPGA-relatedWork/RISC...

5. https://www.inf.ethz.ch/personal/wirth/FPGA-relatedWork/RISC...

6. https://www.colbyrussell.com/staging/aubergine/emu.html?imag...


I really like Ghuloum’s approach: you start by compiling a tiny, trivial subset of your source language all the way down to x86 assembly. Then you add new language features, one step at a time. In step one, you just return constants; in a later step you handle addition and subtraction; and so on. Every step is small enough to feel manageable, and at the end of the every step you have a working compiler.

That approach sounds like the same one in Jack Crenshaw's very lucid tutorial:

https://compilers.iecc.com/crenshaw/

He uses Pascal to write a Pascal compiler, but the basic steps seem to be the same --- start with a very simple lexer, add pieces to it incrementally, then write a recursive-descent parser on top of that and gradually extend it to handle more of the language.

Another simplification that really helps, once you're used to recursive descent and have noticed that the functions for each operator/precedence level are very similar, is to refactor them into a lookup table and an even simpler function that uses it, creating "precedence climbing":

https://www.engr.mun.ca/~theo/Misc/exp_parsing.htm

That makes it very easy to add/change/extend operators, or even have them be user-defined(!) and dynamically modifiable.

As a demonstration of how absolutely tiny a compiler can be, this one is worth inspecting carefully --- it also uses a handwritten lexer and recursive-descent parser, but doesn't make an AST and just emits code for a stack VM: https://news.ycombinator.com/item?id=8558822

...and someone modified that one to make an x86 JIT: https://news.ycombinator.com/item?id=8746054


I remember reading Jack Crenshaw's "let's build a compiler" tutorial in the 90's. Though pretty outdated, I can definitely recommend it to everyone interested in compiler construction, because it's written from the perspective of, well, a compiler creator.


At uni I went through a module on compilers which basically covered all the Dragon book.

I then followed this on-line course where you actually write a compiler for the COOL language. It has been one of the best educative things I've ever done in my life and I can't help recommending it.

https://lagunita.stanford.edu/courses/Engineering/Compilers/...

I'd love to do something like that in my professional career but unfortunately 90% of the time I ended up carrying out way less interesting work (writing application for end users).


There's no better advertisement for the utility of learning to write a compiler than this introduction from the excellent "Engineering a Compiler" by Cooper and Torczon:

Why Study Compiler Construction?

A compiler is a large, complex program. Compilers often include hundreds of thousands, if not millions, of lines of code, organized into multiple subsystems and components. The various parts of a compiler interact in complex ways. Design decisions made for one part of the compiler have important ramifications for other parts. Thus, the design and implementation of a compiler is a substantial exercise in software engineering.

A good compiler contains a microcosm of computer science. It makes practical use of greedy algorithms (register allocation), heuristic search techniques (list scheduling), graph algorithms (dead-code elimination), dynamic programming (instruction selection), finite automata and push-down automata (scanning and parsing), and fixed-point algorithms (data-flow analysis). It deals with problems such as dynamic allocation, synchronization, naming, locality, memory hierarchy management, and pipeline scheduling. Few software systems bring together as many complex and diverse components. Working inside a compiler provides practical experience in software engineering that is hard to obtain with smaller, less intricate systems.


This a terrific quote.

Having written a production compiler in a previous lifetime, and talked with developers over the years since then, it is clear the difference that knowledge makes in their day-to-day programming.


”A compiler is a large, complex program”

Writing a compiler isn’t much harder than writing an interpreter, if at all.

Assuming your language can be parsed without running it (Perl, for instance, cannot, in general. See https://www.perlmonks.org/?node_id=663393), it is just a matter of doing that and emitting code for every statement, plus some bookkeeping in order to fix up branches. If the language you are compiling is simple, that’s not hard to do.

The result will not be the fastest possible, but it will be compiled code, and it will be faster (sometimes significantly so) than interpreting the program.


Are you trying to dispute that compilers are large and complex? It seems like you are saying that if the language is simple, a non-optimizing compiler for it can be simple, which seems somewhat obvious. I think the "microcosm of computer science" and associated complexity become apparent when the source language is not simple, the middle end contains many optimization passes, and the back end requires more than simple statement emission.


Yes. The categorical statement “compilers are large and complex” is incorrect, and IMO is detrimental to getting people interested in the field.

We don’t say “Web applications are large and complex”, either.

Just as you can start simple with a web application and, if needed, grow it, you can start with a simple compiler and, if needed, grow it.

(And, on current hardware, I think the complexity of the language matters less than it used to. Single pass languages such as pascal still are easier targets, but computers have plenty of RAM, so fixing up things is easier than it used to be.


Writing a trivial and inefficient compiler is indeed not that complex, especially if you choose a simple language. Writing a smart optimizing compiler must be a pretty different kettle of fish.


And a good, optimizing interpreter also is "a large, complex program". Both can be fairly easy if implemented naively, but gain complexity quick.


It is unclear if Perl is actually a language.


Can't wait for the next parts of this!

I think this incremental approach must be the way to do it.

I was inspired by 8cc to write my own C compiler; but I ended up burnt out on that project. (The standard document is just infuriating. There's so much ambiguity and disorganization in the document, and the actual technical specifications are like an afterthought to the novel of text.)

May be time to get back into it!


Check out http://www.craftinginterpreters.com for something more in-depth.


Love it. This is exactly how I wrote compilers for a living back in the day – I always got the minimum in end-to-end demo running because it was always the best way to avoid ugly surprises.


To someone who knows more about the subject - am I correct that actually encoding/parsing the language is a small part of creating a compiler comparing to AST optimization?

Optimizing e.g. C to create much more optimal Assembly (especially using some more non-standard instructions) seems like a much more daunting task


As others have said, yes, that's mostly the case.

Even parsing C and elaborating it to a correctly typed AST is not quite as simple as others are making it out to be, though. Getting all the implicit type conversions correct is not completely trivial, and is a popular source of bugs (see some examples in http://www.cs.utah.edu/~regehr/papers/pldi11-preprint.pdf for instance).

There are also some annoying ambiguities in the grammar and in name resolution rules that mean that sometimes it's not easy to tell whether something is supposed to be a type name or a variable name: https://jhjourdan.mketjh.fr/pdf/jourdan2017simple.pdf

C is a "simple" language in many senses of the word, but it has a lot of complex details. It's fine to gloss over them when writing a compiler for a language very similar to C for learning purposes, but getting everything just right for actual C is tough.


Yes, stuff like the `a * b` ambiguity (declaration or expression?), typedef being just another storage class specifier (thus, `static int var` and `typedef int mytype` being the exact same syntactic forms), having complex rules for implicit promotions and conversions, the mess between signed and unsigned.

All of these design warts of C show up clearly when attempting to write a compiler and are not very obvious to most users of the language.


Function pointer declarations is also a fun one, specially in a function declaration as return value and parameters.


Typedefs are a little trickier to handle (particularly the "when do they actually take effect" part), but "the `a * b` ambiguity" is not hard, because if 'a' is a typedef or qualifier/specifier or basically anything that a declaration can start with, it's a declaration.

C++, on the other hand, is definitely far harder to parse, especially if you include things like templates.


Lexing, parsing and generating the AST are comparatively straightforward and a very small part of a real optimising compiler.

The hard stuff is everything related to the code generation – register allocation, instruction selection and scheduling, dealing with any peculiarities of the target ISA and microarchitecture, etc. Your goal here is to preserve the semantics of the HLL constructs while emitting the optimal sequence of instructions for the CPU to execute.


It depends on your source and target. The two parts are practically different disciplines. If you target llvm or naive assembly as output, most of your work will be in translating semantics, and depending on the language complexity, there can be a lot to implement. Or not that much, in simple languages like C.

The back end is mostly about transforms that preserve meaning, with a long tail of attempts to prove things about the code to make things faster in certain cases.

You can also have languages with complex front ends that need lots of work, whether it's preserving runtime type information (a big serialisation job that's not strictly back end work), interesting type systems that also are trying to prove things about the code, features that require increasingly elaborate transformations before they can be handed off to any common back end (like closures, iterators, async, continuation passing style).


absolutely. that's why llvm exists - you write the front end (the comparitvely easy part), emit non-optimised llvm-ir - llvm does the hard optimisations.


I find that the assumption that program control requires boolean values (of any kind) facinating, since there are at least two languages that use success/failure as the controlling mechanism - Snobol (in all its incarnations) with the follow up language of Icon/UnIcon.

Neither have the concept of booleans, but use the idea that a computation can succeed and return a result or fail and return no result. This leads to making certain kinds of expressions that normally require booleans and the associated logical operators to be simpler and more clear, as below:

if a < b && b < c && c < d then ...

compared with

if a < b < c < d then ....

Which is easier to mentally parse?


I really liked how Icon handles function success/failure separately from return values. It seems very close to Go's error handling conventions, but with a little more magic. Wikipedia gives the following Icon example code:

  if a := read() then write(a)
which would translate into something like the following Go code:

  a, err := read()
  if err == nil {
      write(a)
  }


Simplify this to

write(a := read())

where write will not be executed if read fails, nor will a have any value assigned to it. Failure is an option and not an error.

So to copy standard input to standard output, we do

every write(read())

At the end of file, the read fails, the write fails and then the every fails and on you go.


Anyone that wants to write C compilers has two good books for it.

"A Retargetable C Compiler: Design and Implementation", from David R. Hanson and Christopher W. Fraser

"Compiler Design in C", from by Allen I. Holub

A bit old, but there are not significant changes to C since they were written, other than optimizers taking advantage of UB.


I read that as "A Regrettable C Compiler"


> You also need to decide whether to write your own parser and lexer or use automatic parser and scanner generators (e.g. flex and bison).

Really? There are much better choices available in 2017. ANTRL 4 leaps to mind which has been in development since 1989.


ANTLR is really only good for the Java ecosystem. The quality of the generated C/C++ code is low and it has a large number of dependencies. C++ development also lagged pretty far behind Java when ANTLR 4 was released.

I don't know of any "real" language written in C or C++ that uses ANTLR.

The state of the art for production languages in C/C++ is either hand-written parsers or yacc/bison. I've never even seen flex used, and I've looked at over 20 lexer/parser implementations for "real" languages.


Java, C#, Go, Python, JavaScript, Swift and C++ although everything gets started with Java and filled out to the other targets.

You may be thinking of ANTRL 3. I really didn't much care for the ANTLR 3 C target but I use ANTRL 4 and its C target. It is really much much better than ANTRL 3. TParr says ANTLR 4 is the parser generator he always wanted to make.

However, I'm not still pitching ANTRL as a parser for a real compiler. ANTRL rules the world for little languages, domain specific languages but a production language would almost certainly would use a hand crafted recursive decent compiler.

This is a learning project. So I'd strongly recommend anyone learn modern tools and methods and learn those. bison is C++ native with experimental Java support and nothing else. flex is C and C++ only. Not a fan.


Another good one is JavaCC, but not at the same level as ANTRL.



The "recursive descent parsing" part is not very well explained or thorough. It doesn't have good example...

I still enjoyed reading it though.

EDIT:

I've the recursive descent parsing article on wikipedia, and I'm amazed how the C code example is missing parts... Kinda sad.


I would be more interested in learning how to write a transpiler...

As long as I can compile a language in C, it removes a lot of work.


That is just a compiler whose backend generates code in another language instead of machine code.


Well I find it more interesting...

Also C compilers can do a lot of micro optimizations that are not trivial, so compiling something to C or maybe LLIR seems like a better choice.


That is usually a fallacy, that only looks simple as first step.

The first issue is the impedance mismatch between the original language semantics and what is possible to represent in C.

Two examples are tail calls and continuations, that are hard to get right at C level, either requiring compiler extensions, longjmp/setjmp tricks, or a little bit of assembly.

Then there is the whole issue with undefined behaviour in C, introducing bugs that didn't exist on the original code.

Even if the generated code is clean today regarding UB, you cannot ensure that a new C compiler version won't be introducing such issues.

Having said this, there are languages like Eiffel and Nim, whose main backend relies on C compilers, instead of compiling directly to pure native code.

Skipping Objective-C and C++ here, because although they started as C pre-processors, their code was to be an extension to C and so their semantics also imply C semantics.

Compiling to some kind of IR, like LLIR is more sane.

Then you can do whatever you want with it, generate code in another programming language, machine code, or a plain interpreter for bootstrapping purposes.


I agree with most of your points, but...

> Even if the generated code is clean today regarding UB, you cannot ensure that a new C compiler version won't be introducing such issues.

This is not correct. Undefined behaviour is specified in the standard, and a good compiler-to-C will always generate UB-free code.


The standard specifies all known cases, the list is not exhaustive from all possible C implementations.


I am sorry but I don't understand... The standard specifies what is UB, and the implementations must behave accordingly to be standard-compliant.

If your code avoids UB behavior as specified by the standard it will run in the same way when compiled with any standard-compliant compiler.


Compiler optimization is a tricky business, where every vendor does all they can to beat the benchmarks game.

What gets documented on the standard is what compiler vendors that bother to seat at ANSI C meetings agree to document as such.

By no means does the standard forbid compiler vendors to take advantage of situations not yet documented as such.

Eventually new cases of UB might get added to the standard.

The current set of about 200 scenarios of UB in C11 weren't all there in C89.

Another example, is that you can have code that is perfectly valid, UB free, but thanks to PGO and code re-writting rules on the optimizer stage, gets rewritten with UB side effects at a later stage in the optimizer pipeline.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: