Fabrice Bellard (who also had some spare time to create ffmpeg, QEMU, and a few others like a software implementation of a 4G-LTE base station) wrote tcc, the tiny C compiler: https://bellard.org/tcc/
Small, fast and of course self-compiling! The project is the better compiler out of a first project named OTCC which won the IOCCC (yes, the obfuscated C programming contest - less than 2k bytes). https://bellard.org/otcc/
And from it grew tccboot (https://bellard.org/tcc/tccboot.html), a boot loader which compiles the Linux kernel on the fly from source then runs it, the entire process taking only a few seconds ("15 seconds on a 2.4 GHz Pentium 4.")
What are some current best practices for writing (C) compilers that were not common when, say, gcc 1.0 was written? Would someone writing a basic C compiler today approach it differently than Stallman did with gcc 1.0?
IMO one big advance would be writing the compiler in something other than C. C is a bad language for writing compilers in (including C compilers)!
Even (cough) C++ is better than C for this use case, e.g. comparing LLVM vs. GCC. But C++ has problems too though -- the TableGen language in LLVM is an interesting example of why a big C++ project like LLVM/Clang doesn't want all its logic/specification in C++. They really use C++ to the hilt and they still want/need a DSL.
Other interesting options would be Swift, Rust, OCaml, or Go. Swift, Rust, and OCaml all have algebraic data types.
I guess choosing something other than C or C++ is a little unusual because there is the tendency to self-host. The Rust, OCaml, and Go compilers are written in Rust, OCaml, and Go, respectively. I'm pretty sure Swift is a lot of C++, but that will probably change as times goes on.
I think C is fine for a simple compiler. Why are algebraic data types so relevant? Pattern matching is nice, because switch statements plus unpacking is more verbose.
For big optimizing compilers you appreciate abstractions like C++ enables. for example, a data flow Analysis framework.
Algebraic data is just so perfectly suited to abstract syntax. It's clearly possible to model an AST in C, but not to do it in an ergonomic z typesafe way.
Depends if you mean a hobby/small C compiler or if you were trying to write something of similar scope to gcc or Watcom, say. If the latter, there have been big developments in the optimization stages (not least due to CPU and architecture changes over the years) and the approach to targeting runtimes would be quite different (far less likely to target x86 directly nowadays when there are things like LLVM or even the CLR/JVM/similar).
At the risk of sparking controversy, I think Go is a great example of "how would they do it now?" which goes further than compiler techniques and back to rethinking what the runtime should even be.
> I think Go is a great example of "how would they do it now?" which goes further than compiler techniques and back to rethinking what the runtime should even be.
Agreed, but I wish they released the runtime as a separate component, in the same spirit as LLVM.
I re-used the Digital Mars C++ optimizer and back end for the dmd D compiler, so that would be like using LLVM for an optimizer and back end.
I still wind up spending nearly all my time on the D front end, and not the lexer/parser. The time spent on the lexer/parser is a rounding error.
Besides, build the lexer/parser by hand. The mystery of how to do them will evaporate, and you'll find you can apply that knowledge to all sorts of other tasks.
Lexers are easier to write by hand than parsers are but lexer generators such as flex are still very useful. The thing they are most useful at is that their algorithms always led the longest match possible, which is something that can be easy to get wrong by hand.
Regarding LALR parsers, the best thing about them is that they error out if they think your grammar might be ambiguous, which is specially useful if you are designing your own language. It is also easier to evolve a grammar if you build the parser with a high level tool than if you code everything by hand. For example, a top down parser needs to know at every point what are the valid lookahead tokens that might appear and adding a new rule to the grammar might require changing the lookahead tokens in many different spots.
The place were yacc doesn't shine is if your grammar isn't context free, in which case you might need to use hacky tricks or advanced parsing techniques. Bottom up parser generators also aren't the best at generating informative error messages (but they do a decent job of doing a bare minimum default error message)
Thank you very much - that's all useful - especially the bit about bottom up parser generators being not the best at generating informative error messages.
If you haven't used flex & bison and you're writing your first compiler, you need to learn them and still take care of the things they don't do for you out of the box (they don't know C and the context of the input).
IOW, it looks like the hardest work may still be up to you, but now you're dependent on external tools and self-hosting has become a bit questionable/problematic (you'd want those tools to be compiled by your compiler as well, right?).
Parsing tokens, OTOH, isn't a big deal. You don't need heavy artillery for them.
IOW, if you haven't had lots of experience with those tools or compiler creation already, you may not gain much from using them for the first time. If you have, you may reuse your experience and possibly even code.
Thank you very much for your reply. That's helpful and fits in well with my experience so far.
I asked the question in the context where self-hosting was required, but I probably don't need it.
On the other-hand, more dependencies does complicate things. So I was interested to know whether Flex & Bison (or equivalent tools in other languages e.g. Alex & Happy for Haskell) were worth the extra complexity.
I started from one of the Appel books that I had on my book shelf for years: Modern Compiler Implementation in ML.
Reading Appel, he writes "The task of constructing LR(1) or LALR(1) grammars is simple enough to be automated. And is so tedious to do by hand that LR parsing for realistic grammars is rarely done except using parser-generator tools." p68. (Agreeing with what you advise, even Appel admits that the lexing of tokens isn't a big deal.)
But after that the only C compiler (for example) that I could find that used Bison (or Yacc) was the Portable C Compiler, so I was starting to suspect that Appel might be taking a rather academic view.
That's likely because most C compilers (in my experience) use top-down parsers (LL) rather than bottom-up ones (LR).
In contrast to LR parsers, LL parsers are quite easy to write and maintain by hand (recursive descent, parser combinators, etc.), and the nature of such hand-written parsers makes it comparatively easy to step outside theoretical bounds and implement context-sensitive portions of the grammar. Such parsers have some other benefits, too, like being easier to debug and making it easier to provide useful error messages (but see Menhir for OCaml, for example, which offers some improvements in these areas over traditional LR parser generators).
What Appel has said is true—it's rare to encounter hand-written LR(1) or LALR(1) parsers in practice. However, many practical languages are easy enough to parse with hand-written top-down parsers that it's quite common to encounter hand-written LL(1)-ish parsers in practice.
I had been wondering how one was meant to provide meaningful error messages for code that failed to parse or failed to lex using the LR approach and looks like maybe the answer is "with difficulty", which is definitely a downer.
Phillipe Charles Phd thesis[0] describes a method for generating good error messages with an LALR parser generator. lpg[1] implements these techniques.
I don't think it's ever been the case that C front ends used lex/yacc. Whether it's "best practice" I suppose depends on who you ask! (You'll certainly learn it with parser generators in university because writing a hand-written parser takes too long for an undergrad class, and they will only cover a tiny subset of C if anything.)
You can see here that the original C compilers use a hand-written expression parser using the Shunting Yard algorithm, and I'm sure the rest of the front end is hand-written too:
(2) The preprocessor. While it doesn't prevent you from using lex/yacc for the C parser, this is an entirely separate lexer and parser that is always written by hand AFAICT.
Awk, on the other hand, is almost always parsed with lex/yacc. The language was almost designed around those tools.
I think if you're writing say your first compiler you should try to use lex and yacc. I think you can solve the context-sensitivity problem but I haven't done it myself. Operator precedence probably requires some extensions or maybe you can encode the many levels of C precedence in the grammar.
One thing I found from implementing a shell parser [1] is that the POSIX spec uses yacc grammars, even if the languages aren't best parsed with bottom-up parsing. For example, all shells except bash use top-down parsing.
Analogously, the C grammar may "manually" encode the operator precedence, even though the original C parsers and current ones like GCC/Clang actually use an operator precedence algorithm like the Shunting Yard algorithm, not yacc's shift-reduce algorithm.
Thank you very much for all that information - much appreciated!
The Portable C Compiler was written using Yacc. [1]
>I think if you're writing say your first compiler you should try to use lex and yacc.
I've actually started with something higher level still: BNFC [2], which has taken me quite a long way quite quickly. It generates lexer, parser, operator precedence and AST along the lines of the Appel books for a variety of backends: C, Java, OCAML, Haskell etc.
But now I'm running into bugs and limitations in BNFC. I may fix some of those if I can, but at some point I will probably dump BNFC and then hand-edit the generated Flex/Bison.
Maybe that's what people do eventually - start with a high-level tool and then move lower?
OK BNFC is interesting! I think that kind of tool is the holy grail, but they all have limitations. You kind of have to design your language AROUND it, and C obviously doesn't fit in that category.
For example, I mentioned that Awk was designed with yacc.
It depends how ambitious you want to be. You can tweak C to be more easily parsed, but then you will parse fewer real programs.
It's not surprising to me at all that BNFC will have problems with C. So I guess I would say you should use high level tools like BNFC / lex / yacc until you can't.
FWIW here is another little problem you will run into with C:
I guess the point is that parser generators and even grammars themselves are leaky abstractions! To effectively use parsing tools you have to be familiar with their internals and the underlying algorithm. If there are two levels of code generation like BNFC, then that makes this task harder!
I think it makes sense to start high and go low. It also makes sense to start low-level and go high. Once you are sick of writing parsers by hand, you will understand how to write a better parser generator than the ones that currently exist :)
Nice. This is a really ambitious project for just 4 contributors (and one that looks like they did most of the work). I'm curious why ASM and not LLVM?
Edit (comment above unaltered):
I got down voted by someone which tells me I missed an obvious answer to my question. Did I miss something?
I've noticed a lot of compiler projects lately tend to default to LLVM, I'm curious of the strengths and weaknesses of LLVM vs ASM since the author here chose to compile to ASM.
My impression (as someone working on a compiler of my own, using LLVM), is that while you get a lot out of building on LLVM (portability across platforms, a collection of "free" optimizations, and a well defined low-level IR), it comes at the cost of a huge toolchain (takes about 5-10 minutes to compile from scratch on my laptop), and it is not stable across releases (the extent to which you are affected depends on the features your compiler depends on though). I think for toy projects it is probably more interesting to write the code gen yourself; and for projects focused on a specific platform, it is perhaps desirable to handle it yourself. Some people probably just don't want the dependency. My perspective is that I don't want to reinvent the work on cross platform support, or the optimization passes, and would rather take advantage of those efforts by depending on LLVM. I don't know that this makes sense for this project though.
LLVM is big, heavy, complicated, powerful. Kind of against the spirit of writing a lightweight C compiler. On the other hand, an x86 assembler can be easily implemented by one person and requires minimal cpu and memory resources.
> Currently it generates 16-bit and 32-bit 80386+ assembly code for NASM that can then be assembled and linked into DOS, Windows and Linux programs.
I wonder why not 64Bit? Hardly anyone uses 32bit processors anymore, and even in the Windows world 64Bit systems are slowly taking over, so it would seem more logical to me to compile for modern processors, instead of 16bit architecture.
That is so but the subset of x86_64 that you'd use for code generation is pretty much the same as the subset of x86 that you'd use for code generation.
- additional 64-bit types will need additional code for type checks and conversions, switch, etc
- new or improved code generator is needed (ideally, 64-bit non-pointer types should also be supported in 32-bit programs, possibly as register pairs)
- new object and executable formats to handle
- possibly new ABIs to support (pushing arguments onto the stack and never bothering to align the stack pointer onto a multiple of 8 or 16 boundary is kinda easy, it's pretty much free)
I don't see any reason the preprocessor couldn't be integrated with the main parsing pass. It's all top-down (things defined before they're used) just like C.
"versions of GCC prior to 4.8 also allow bootstrapping with a ISO C89 compiler and versions of GCC prior to 3.4 also allow bootstrapping with a traditional (K&R) C compiler."
So you might need to build tcc -> gcc 3.3 -> gcc 4.7 -> gcc 7.2
Small, fast and of course self-compiling! The project is the better compiler out of a first project named OTCC which won the IOCCC (yes, the obfuscated C programming contest - less than 2k bytes). https://bellard.org/otcc/