Let's Build a Compiler

userbinator · on July 16, 2019

This looks like something similar to Crenshaw's excellent tutorial of the same name:

https://compilers.iecc.com/crenshaw/

and its x86 port: https://github.com/lotabout/Let-s-build-a-compiler

As interesting as Lisp-family languages are, I still think it's better to use something with more traditional syntax and start with parsing, because that both reaches a much wider audience and gives a very early introduction to thinking recursively --- the latter being particularly important for understanding of the process in general. A simple expression evaluator, that you can later turn into a JIT and then a compiler, is always a good first exercise.

tom_mellior · on July 16, 2019

> a very early introduction to thinking recursively

Not everything needs to be on a so-you-have-never-programmed-before level. This series explicitly assumes "some knowledge of native-code build processes, Lisp, C, and x86 assembly language". People who know Lisp should have had an introduction to thinking recursively already.

Parsing would be a useless distraction for someone interested in writing a Scheme compiler in Scheme.

nzentzis · on July 16, 2019

I think I partially agree, but that it mainly depends on the audience. I've found that parsing and recursion appear in more areas than just compilers, to the point where it'd be relatively difficult to avoid them. Machine code generation and the underlying implementation of high-level primitives (closures, tail-call recursion, and the like) hasn't, at least in my experience, been "naturally-occurring" to the same extent.

In terms of creating a "learn compilers from scratch" resource, Crenshaw's approach is definitely better. The trade-off would be that it'd take longer to get past the "writing a recursive descent/LR parser" phase, and it might never get to higher-level language features at all depending on the input language you go with.

phouchg · on July 17, 2019

I've worked my way through quite a few tutorials on implementing Scheme I think you're absolutely correct that detailed material on implementing high level features, especially in machine code, is lacking. I'm very grateful for this tutorial. Can't wait for the next part!

faitswulff · on July 16, 2019

Haven't started reading yet, but I'm really digging this trend of syntax-highlighted bare text blogs[0]. Is this a template or is it hand-crafted?

[0]: https://christine.website/blog/h-language-2019-06-30

nzentzis · on July 16, 2019

Currently I'm using a modified version of the after-dark[0] theme for Zola.

[0] https://www.getzola.org/themes/after-dark/

chubot · on July 16, 2019

Not related to this site, but my site is "hand-crafted':

http://www.oilshell.org/site.html

http://www.oilshell.org/blog/

And yes one of the main things I had to integrate myself was syntax highlighting (via pygments).

FWIW, I think proportional fonts for text and fixed width for code is more readable.

A few people asked me about the tools, and I dumped them here (but they are not supported, may not be runnable):

https://github.com/oilshell/blog-code/tree/master/tools-snap...

kitd · on July 16, 2019

I do like the simplicity, but the lack of proportional fonts for the text makes reading much more effort than it should be. There's a reason printers have used proportion and serifs for centuries.

vednoc · on July 16, 2019

The website seems to be using a template [0], and the CSS is a framework [1].

[0]: https://github.com/getzola/after-dark [1]: https://github.com/egoist/hack

nathell · on July 16, 2019

I have attempted following Ghuloum's paper, too. One difference is I wanted to make it as self-contained as possible and didn't want to depend on a C compiler or binutils. So I wrote a simple assembler. Here it is, all in Clojure:

https://github.com/nathell/lithium

It's dormant – I was stuck on implementing environments around step 7 of 24 – but someday I will return to it and make progress.

pjmlp · on July 16, 2019

When I went over the tiger book back at the university, our teacher had a cool approach to overcome that.

Generate bytecode, but in a form that could be easily mapped to macros on a Macro Assembler, thus we only needed to write such macros for each target platform.

From performance point of view it was quite bad, but we got complete AOT static binaries out of it anyway.

Camas · on July 16, 2019

Similar recent thread

"My first 15 compilers" https://news.ycombinator.com/item?id=20408011

norswap · on July 16, 2019

In the save vein, I'm also going to recommend http://www.craftinginterpreters.com/

It's a work in progress, but very well made. I think Bob (who is writing the book) is a great educator.

blondin · on July 16, 2019

it puzzles me sometimes why we programmers are so fascinated by compilers, interpreters, VMs, runtimes, etc. many of these will never make it to the level of, say, a production c++ compiler or a Java VM. and yet we keep building small compilers.

schnitzelstoat · on July 16, 2019

I did the Nand2Tetris course which includes building a basic compiler.

It just helps fully understand how you go from words in a file to actually doing computations and how purely abstract ideas like a 'class' are implemented.

To be fair, I studied Physics and not CS so I didn't have the opportunity to study Compilers at University.

mywittyname · on July 16, 2019

> so I didn't have the opportunity to study Compilers at University.

Lots of CS people haven't either. My university moved compiler theory to the Masters program.

pjmlp · on July 16, 2019

In some countries that is kind of irrelevant because most end up doing masters anyway.

kd5bjo · on July 16, 2019

All craftsmen take an interest in their tools, and in software the tools are made with the same processes we use on a day-to-day basis. It’s the same with blacksmiths and woodworkers to varying degrees.

conrad-mac · on July 16, 2019

Not all those who consider themselves programmers necessarily come from a CS background and so don't learn about concepts like these. To them, walkthroughs like these are fascinating.

willvarfar · on July 16, 2019

Programmers with CS backgrounds also delight in these concepts and things.

setr · on July 17, 2019

Is it not a rite of passage to design & implement your own language, distilling your two years of knowledge and arrogance into one pathetic failure of a design? And from then on appreciating some sense of the difficulty of constructing and maintaining such things?

Perhaps not always languages, but such experiences are vital to our industry!

sn9 · on July 17, 2019

Relevant Steve Yegge blog post: https://steve-yegge.blogspot.com/2007/06/rich-programmer-foo...

acidity · on July 16, 2019

As a side note, is there something similar for building a distributed application (could be a very simple NoSQL DB or maybe some stream processor).

pjmlp · on July 16, 2019

Props for not being yet another C based tutorial. Quite interesting read.

fjfaase · on July 15, 2019

We live in a world were ever PC has at least 2 Gbyte of ram and most CPU's are 64 bits, the section 'Data Representation' begins to explain how 'everything' can be stored in a single 32-bit integer, if we limit integers to 30 bits. What the hack?

nzentzis · on July 16, 2019

Generating 64-bit code would be simpler in many ways, but I decided not to go into it until the basic language features are put in place. The extent of the changes involved in switching to amd64 will help directly show why having an intermediate representation for the generated code would be valuable.

Besides, if you're looking for a high-performance Scheme that fully utilizes all available system resources, there are definitely better options available. :)

fjfaase · on July 16, 2019

For your initial design you could also have chosen to use an additional byte to represent the type of the value. As representing the type would only require 2 or 3 bits, some bits will be unused (probably some more, due to alignment requirements), but maybe later on in the development of the compiler, those bits could be used to store some additional information. That would have made your code a lot simpler.

As you probably want to combine these valuse together into some structures representing the various language constructs, an additional byte to represent the type of the structure, and thus the type of its elements, would also be needed. Than you could do away with the extra bits representing the type.

I just think this is premature optimization and making things unneccessary complex especially for your readers who might want to learn something from it.

tom_mellior · on July 16, 2019

The types of user-defined "structures" are usually identified by a tag inside the structure, not encoded in the pointer as for the few primitive types.

NikkiA · on July 16, 2019

You'd probably be surprised how many GC'ed languages actually avoid the whole tagged/boxed variable thing in pursuit of the performance benefits. ocaml for example is limited to 30bit ints for this reason, the haskell standard only guarantees 30bits from the 'Int' implementation for this reason.

As I said further down this thread though, it's a 'speed' thing, not memory storage thing, the common belief is that boxing is slow, because it was slow in java, but in reality boxing is a) a mostly acceptable trade-off that only loses out in extreme cache-limited situations, and b) something that could be optimised away anyway.

tom_mellior · on July 16, 2019

My comment concerned heap-allocated user-defined types, not primitive types like int. Also, these techniques of tagging primitive types predate Java, so whatever convinced people that they are needed, Java wasn't it. (Though things change, so yes, it's a possibility that they are no longer needed. Do you have benchmarks?)

I agree that a lot of boxing can be optimized away, but often it also can't.

wtfrmyinitials · on July 15, 2019

Keeping values small is still useful for the sake of CPU cache

badsectoracula · on July 15, 2019

Yes, RAM is aplenty, lets have every program use it to its fullest :-P

0815test · on July 15, 2019

Also known as "unused RAM is wasted RAM".

abjKT26nO8 · on July 16, 2019

https://cr.yp.to/bib/1995/wirth.pdf

kevin_thibedeau · on July 15, 2019

Most CPUs are 8-bit with 32-bit beginning to take over. They usually don't have more than 256K.

msla · on July 15, 2019

Most CPUs aren't in self-hosting systems.

kevin_thibedeau · on July 16, 2019

Yes they are. Embedded applications outnumber top end computers by an order of magnitude.

nineteen999 · on July 16, 2019

Perhaps he means that they are not generally self-hosted, in that we generally tend to cross-compile for the embedded hardware, rather than compiling on the embedded hardware.

msla · on July 16, 2019

That's precisely what I meant, and I don't understand how it could be misunderstood if you know what "self-hosting" means in this context.

msla · on July 16, 2019

I think you just agreed with me.

NikkiA · on July 16, 2019

It's premature optimisation... Java has conditioned everyone to believe that boxing variables is slow.

pkaye · on July 15, 2019

As Bill Gates was alleged to have said "640k ought to be enough for anybody."

b212 · on July 15, 2019

He did not say that, sorry.

Darkphibre · on July 16, 2019

But he did say we wouldn't likely see speeds in excess of 64kbps. I save that article for decades, still have it somewhere in a filing cabinet, even though it's been scrubbed from the internet.

He was referencing the limitations of copper POTS, of course. But I still found it funny (if I recall correctly, it was printed side-by-side with his debunking of the 'you only need so much ram' quote).

hermitdev · on July 16, 2019

64kbps was a limitation of old analog phone lines, which were pretty noisy and also also operated at a pretty low frequency (they only allowed up to a certain frequency, anything beyond was clipped on old analog lines). In practice, dont know that I saw anything beyond 56k. And you could only get that on a clean line, near a switching station. My self, not sure I ever saw much beyond 48k, maybe occasionally 52k, but dont think I ever saw a dialup connection hit the full 56k.

Higher speeds werent possible until moving to digital lines (64k for IDSN, single line, 128k for dual). DSL upped the anty, pushing to 1-3 Mbps in my area. Cable really pushed it when I could get 15 Mbps vs best available DSL in my area at around 3 Mbps.

Today, I have around 200 Mbps over cable (gigabit is available), best DSL in the area is still around 25 Mbps. Fiber is not available yet, but AT&T is running fiber, but 2 years in, still not an option and they haven't published pricing/speeds for the area.

Rerarom · on July 16, 2019

My top dial-up speed was 46.6 kbps (not sure if kilobytes or kibibytes)

jazzyjackson · on July 16, 2019

but he was alleged to !