For the majority of use cases a simple rule suffices: Write a parser which gener...

imoldfella · on April 29, 2019

100% agree. If starting now I think you should consider targeting mlir[1]. It fixes many problems you will have with llvm, and allows you to create language specific dialects and transformations. Impressive new work by a top-notch team led by LLVM creator Chris Lattner. It has LLVM as an optional target, so you get that for free.

For a parser, you should consider using tree-sitter[2]. Tree-sitter gives you live editor support for free. Impressive work by Max Brunsfeld.

[1] https://www.youtube.com/watch?v=qzljG6DKgic [2] https://www.youtube.com/watch?v=a1rC79DHpmY

oldandtired · on April 29, 2019

Here, I would disagree with you in part. There are more than two topics that are needed.

First. Design your language if you are not going to use something that already exists. So you need to understand grammars and the various pitfalls. There are quite a number of tutorials on this subject freely available around the web.

Second. Understand what you want your language to actually mean, this is the subject of semantics. So if you are starting out, make a simple grammar and simple associated semantics. Again there is various material around the traps that you can avail yourself of.

Third. Have a look at the various parser generators out there and what they expect in terms of your language. There are many and each has its pros and cons. Again, there are various tutorials about the subject.

Fourth. If your language is relatively simple and has relatively simple semantics, see what is required for the generation of LLVM IR. If there is any complexity in the language and its semantics, LLVM IR may not necessarily be the way to go. So, back to the first and second points, keep your investigatory language simple in both its syntax (grammar) and its semantics.

Fifth. Have fun. The entire subject can be a delight to learn, but some of the material out there is deadly boring and will quickly turn most people off the subject. But it is actually fun. Son enjoy the time you spend learning about the subject.

sischoel · on April 29, 2019

These books are kind of old. Are they still state of the art or is there something newer?

I had a compiler class at university but that was a while ago, so I would like to update my knowledge because there seems to be a lot of exiting stuff going on with compilers and transpilers.

jcranmer · on April 29, 2019

As others have pointed out, a lot of compilers have basically stayed the same in the past 25 years. What's changed has been some new (generally special-purpose) minor optimizations, and the growth of compilers to consider less traditional domains.

The newer things:

* SLP vectorization. Instead of classic vectorization (where you have a loop that you convert to a vector), SLP vectorization tries to form vectors from a single basic block, and it can work quite well for the small SIMD units such as SSE.

* Polyhedral loop transformation. This is sort of the equivalent of the lattice-based dataflow analysis methodology, which is to say it's a very powerful, and slow, general-purpose technique which actually isn't used all that much in production compilers.

* Decompilation, disassembly, and other forms of static binary analysis have progressed a fair amount in the past decade or so.

* Dynamic binary analysis/translation, of which the state of the art is probably Intel Pin (https://dl.acm.org/citation.cfm?id=1065034).

* JITs have evolved a lot. In terms of what's missing from the classic compiler books, this is the big missing area. Things such as tracing, recompilation techniques, garbage collection, guard techniques, etc. have come a long way, and I'm not off-hand aware of any actual good book or paper here that covers most of the topic.

* Superoptimization is a technique that's still mostly in the academic phase, but you are seeing peephole optimizations and other information populated by ahead-of-time superoptimization (e.g., Regehr's Souper work).

* Symbolic and concolic execution is something else that's in the transition phase between academic and industry work. The most advanced use of concolic execution is fuzzing work where concolic execution is used to guide the fuzzer to generate specific inputs to test as many paths as possible.

Most of these topics wouldn't be covered in something as introductory as the Dragon Book (which itself has poor coverage of the major loop transformations), and generally only come into play in more specific scenarios.

rwmj · on April 29, 2019

The basics don't change. There was this rather fun talk on HN 2 weeks ago https://news.ycombinator.com/item?id=19657983 where he asserts that only 8 optimizations are necessary to get about 80% of best case performance, and all of those were catalogued in 1971.

felixangell · on April 29, 2019

Wow, that's a really interesting fact. I hope you don't mind but I updated the article to reference this and credited you accordingly :^)

eatonphil · on April 29, 2019

While some projects within the compiler toolchain reinvent the wheel to varying degrees of increased performance/functionality (e.g. MIR, probably some parsing tools too, gold linker, lldb, etc.), the majority of projects these days take advantage of existing tooling to applying frontends and backends in different ways. Want a JavaScript to LLVM compiler? Map the JavaScript AST using an existing JavaScript parser to the LLVM AST and have LLVM generate code. The combinations are infinite and high quality. Most people (myself included) only experiment at the highest level.

All this is to say the barrier to entry in PL/compiler/interpreter implementation is only as high as you want it to be. You can do the whole thing yourself as a learning experience or you can use existing tools to finish a fairly-fully-featured project in a few hours over a few weeks.