Hacker News new | past | comments | ask | show | jobs | submit login
SmallerC – Small, Simple, Self-Compiling, Single Pass C Compiler (github.com/alexfru)
155 points by peter_d_sherman on Oct 13, 2017 | hide | past | favorite | 59 comments



Fabrice Bellard (who also had some spare time to create ffmpeg, QEMU, and a few others like a software implementation of a 4G-LTE base station) wrote tcc, the tiny C compiler: https://bellard.org/tcc/

Small, fast and of course self-compiling! The project is the better compiler out of a first project named OTCC which won the IOCCC (yes, the obfuscated C programming contest - less than 2k bytes). https://bellard.org/otcc/


And from it grew tccboot (https://bellard.org/tcc/tccboot.html), a boot loader which compiles the Linux kernel on the fly from source then runs it, the entire process taking only a few seconds ("15 seconds on a 2.4 GHz Pentium 4.")


About 3 seconds on my i5 from 2015.


One of the most amazing Javascript apps I've ever seen was also written by Bellard: jslinux

https://bellard.org/jslinux/


I had no idea that the dude who made QEMU also made ffmpeg. Those are two of my favorite apps...I should buy that guy a pizza.


yes he is an amazing guy, he should put a "buy me a pizza" link on his homepage


His newest is RV128IMAFDQC emulator https://bellard.org/riscvemu/


Yeah, Fabrice is a monster.


What are some current best practices for writing (C) compilers that were not common when, say, gcc 1.0 was written? Would someone writing a basic C compiler today approach it differently than Stallman did with gcc 1.0?


IMO one big advance would be writing the compiler in something other than C. C is a bad language for writing compilers in (including C compilers)!

Even (cough) C++ is better than C for this use case, e.g. comparing LLVM vs. GCC. But C++ has problems too though -- the TableGen language in LLVM is an interesting example of why a big C++ project like LLVM/Clang doesn't want all its logic/specification in C++. They really use C++ to the hilt and they still want/need a DSL.

Other interesting options would be Swift, Rust, OCaml, or Go. Swift, Rust, and OCaml all have algebraic data types.

I guess choosing something other than C or C++ is a little unusual because there is the tendency to self-host. The Rust, OCaml, and Go compilers are written in Rust, OCaml, and Go, respectively. I'm pretty sure Swift is a lot of C++, but that will probably change as times goes on.


I think C is fine for a simple compiler. Why are algebraic data types so relevant? Pattern matching is nice, because switch statements plus unpacking is more verbose.

For big optimizing compilers you appreciate abstractions like C++ enables. for example, a data flow Analysis framework.


Algebraic data is just so perfectly suited to abstract syntax. It's clearly possible to model an AST in C, but not to do it in an ergonomic z typesafe way.


In addition to being more succinct, algebraic data types are also better about warning if you forgot to handle one of the cases.


One cool benefit of using C for a C compiler though is the ability to self host the compiler.


Depends if you mean a hobby/small C compiler or if you were trying to write something of similar scope to gcc or Watcom, say. If the latter, there have been big developments in the optimization stages (not least due to CPU and architecture changes over the years) and the approach to targeting runtimes would be quite different (far less likely to target x86 directly nowadays when there are things like LLVM or even the CLR/JVM/similar).

At the risk of sparking controversy, I think Go is a great example of "how would they do it now?" which goes further than compiler techniques and back to rethinking what the runtime should even be.


> I think Go is a great example of "how would they do it now?" which goes further than compiler techniques and back to rethinking what the runtime should even be.

Agreed, but I wish they released the runtime as a separate component, in the same spirit as LLVM.


Supplementary to that and also a bit more specific I'd be grateful for an answer to the following:

Is it considered best practice today to use Flex & Bison for the front-end of (for example) a basic C compiler?


Lexing and parsing are the simplest, by far, parts of a compiler. (That's why it is possible for Flex & Bison to exist.)

99% of the time spent writing a compiler will be elsewhere.

If you want a correct to the last drop C preprocessor, prepare to spend 6 months on that, minimum. Or just use Warp:

https://github.com/facebookarchive/warp

and then you can have fun just working on the compiler bit.


Thank you - that's informative.

>99% of the time spent writing a compiler will be elsewhere.

Out of interest, has the distribution of labor in that remaining 99% changed at all since you first started out?

For example, do things like LLVM change anything?


I re-used the Digital Mars C++ optimizer and back end for the dmd D compiler, so that would be like using LLVM for an optimizer and back end.

I still wind up spending nearly all my time on the D front end, and not the lexer/parser. The time spent on the lexer/parser is a rounding error.

Besides, build the lexer/parser by hand. The mystery of how to do them will evaporate, and you'll find you can apply that knowledge to all sorts of other tasks.


Lexers are easier to write by hand than parsers are but lexer generators such as flex are still very useful. The thing they are most useful at is that their algorithms always led the longest match possible, which is something that can be easy to get wrong by hand.

Regarding LALR parsers, the best thing about them is that they error out if they think your grammar might be ambiguous, which is specially useful if you are designing your own language. It is also easier to evolve a grammar if you build the parser with a high level tool than if you code everything by hand. For example, a top down parser needs to know at every point what are the valid lookahead tokens that might appear and adding a new rule to the grammar might require changing the lookahead tokens in many different spots.

The place were yacc doesn't shine is if your grammar isn't context free, in which case you might need to use hacky tricks or advanced parsing techniques. Bottom up parser generators also aren't the best at generating informative error messages (but they do a decent job of doing a bare minimum default error message)


Thank you very much - that's all useful - especially the bit about bottom up parser generators being not the best at generating informative error messages.


There are tools such as merr that help with this but generally its hard to beat hand-crafted error messages developed with love and sweat :)


If you haven't used flex & bison and you're writing your first compiler, you need to learn them and still take care of the things they don't do for you out of the box (they don't know C and the context of the input).

IOW, it looks like the hardest work may still be up to you, but now you're dependent on external tools and self-hosting has become a bit questionable/problematic (you'd want those tools to be compiled by your compiler as well, right?).

Parsing tokens, OTOH, isn't a big deal. You don't need heavy artillery for them.

IOW, if you haven't had lots of experience with those tools or compiler creation already, you may not gain much from using them for the first time. If you have, you may reuse your experience and possibly even code.


Thank you very much for your reply. That's helpful and fits in well with my experience so far.

I asked the question in the context where self-hosting was required, but I probably don't need it.

On the other-hand, more dependencies does complicate things. So I was interested to know whether Flex & Bison (or equivalent tools in other languages e.g. Alex & Happy for Haskell) were worth the extra complexity.

I started from one of the Appel books that I had on my book shelf for years: Modern Compiler Implementation in ML.

Reading Appel, he writes "The task of constructing LR(1) or LALR(1) grammars is simple enough to be automated. And is so tedious to do by hand that LR parsing for realistic grammars is rarely done except using parser-generator tools." p68. (Agreeing with what you advise, even Appel admits that the lexing of tokens isn't a big deal.)

But after that the only C compiler (for example) that I could find that used Bison (or Yacc) was the Portable C Compiler, so I was starting to suspect that Appel might be taking a rather academic view.

[1] https://en.wikipedia.org/wiki/Portable_C_Compiler


That's likely because most C compilers (in my experience) use top-down parsers (LL) rather than bottom-up ones (LR).

In contrast to LR parsers, LL parsers are quite easy to write and maintain by hand (recursive descent, parser combinators, etc.), and the nature of such hand-written parsers makes it comparatively easy to step outside theoretical bounds and implement context-sensitive portions of the grammar. Such parsers have some other benefits, too, like being easier to debug and making it easier to provide useful error messages (but see Menhir for OCaml, for example, which offers some improvements in these areas over traditional LR parser generators).

What Appel has said is true—it's rare to encounter hand-written LR(1) or LALR(1) parsers in practice. However, many practical languages are easy enough to parse with hand-written top-down parsers that it's quite common to encounter hand-written LL(1)-ish parsers in practice.


Thank you - that's very useful information!

I had been wondering how one was meant to provide meaningful error messages for code that failed to parse or failed to lex using the LR approach and looks like maybe the answer is "with difficulty", which is definitely a downer.


Phillipe Charles Phd thesis[0] describes a method for generating good error messages with an LALR parser generator. lpg[1] implements these techniques.

[0] http://jikes.sourceforge.net/documents/thesis.pdf

[1] https://sourceforge.net/projects/lpg/


> were worth the extra complexity

No.


Succinct!

If anyone should know it would be you, so thank you for your verdict.


I don't think it's ever been the case that C front ends used lex/yacc. Whether it's "best practice" I suppose depends on who you ask! (You'll certainly learn it with parser generators in university because writing a hand-written parser takes too long for an undergrad class, and they will only cover a tiny subset of C if anything.)

You can see here that the original C compilers use a hand-written expression parser using the Shunting Yard algorithm, and I'm sure the rest of the front end is hand-written too:

http://www.oilshell.org/blog/2017/04/22.html

C predates lex/yacc so the syntax wasn't really designed with them in mind.

Some other potential problems:

(1) Distinguishing types and variables in C:

https://eli.thegreenplace.net/2011/05/02/the-context-sensiti...

https://en.wikipedia.org/wiki/The_lexer_hack

(2) The preprocessor. While it doesn't prevent you from using lex/yacc for the C parser, this is an entirely separate lexer and parser that is always written by hand AFAICT.

Awk, on the other hand, is almost always parsed with lex/yacc. The language was almost designed around those tools.

I think if you're writing say your first compiler you should try to use lex and yacc. I think you can solve the context-sensitivity problem but I haven't done it myself. Operator precedence probably requires some extensions or maybe you can encode the many levels of C precedence in the grammar.

One thing I found from implementing a shell parser [1] is that the POSIX spec uses yacc grammars, even if the languages aren't best parsed with bottom-up parsing. For example, all shells except bash use top-down parsing.

Analogously, the C grammar may "manually" encode the operator precedence, even though the original C parsers and current ones like GCC/Clang actually use an operator precedence algorithm like the Shunting Yard algorithm, not yacc's shift-reduce algorithm.

e.g. this grammar manually encodes it:

https://www.lysator.liu.se/c/ANSI-C-grammar-y.html

Not that yacc can't parse expressions -- it's just something interesting I found.

Also, for shell, there are a bunch of parsing rules that are necessarily listed separately from the grammar, and I'm sure the same is true for C.

[1] http://www.oilshell.org/blog/tags.html?tag=parsing-shell#par...


Thank you very much for all that information - much appreciated!

The Portable C Compiler was written using Yacc. [1]

>I think if you're writing say your first compiler you should try to use lex and yacc.

I've actually started with something higher level still: BNFC [2], which has taken me quite a long way quite quickly. It generates lexer, parser, operator precedence and AST along the lines of the Appel books for a variety of backends: C, Java, OCAML, Haskell etc.

But now I'm running into bugs and limitations in BNFC. I may fix some of those if I can, but at some point I will probably dump BNFC and then hand-edit the generated Flex/Bison.

Maybe that's what people do eventually - start with a high-level tool and then move lower?

[1] https://en.wikipedia.org/wiki/Portable_C_Compiler

[2] https://github.com/BNFC/bnfc


OK BNFC is interesting! I think that kind of tool is the holy grail, but they all have limitations. You kind of have to design your language AROUND it, and C obviously doesn't fit in that category.

For example, I mentioned that Awk was designed with yacc.

It depends how ambitious you want to be. You can tweak C to be more easily parsed, but then you will parse fewer real programs.

It's not surprising to me at all that BNFC will have problems with C. So I guess I would say you should use high level tools like BNFC / lex / yacc until you can't.

FWIW here is another little problem you will run into with C:

https://en.wikipedia.org/wiki/Dangling_else

I guess the point is that parser generators and even grammars themselves are leaky abstractions! To effectively use parsing tools you have to be familiar with their internals and the underlying algorithm. If there are two levels of code generation like BNFC, then that makes this task harder!

I think it makes sense to start high and go low. It also makes sense to start low-level and go high. Once you are sick of writing parsers by hand, you will understand how to write a better parser generator than the ones that currently exist :)


Guys, i just started reading through source code of 8cc and now this new project :). https://github.com/rui314/8cc


Nice. This is a really ambitious project for just 4 contributors (and one that looks like they did most of the work). I'm curious why ASM and not LLVM?

Edit (comment above unaltered): I got down voted by someone which tells me I missed an obvious answer to my question. Did I miss something?

I've noticed a lot of compiler projects lately tend to default to LLVM, I'm curious of the strengths and weaknesses of LLVM vs ASM since the author here chose to compile to ASM.


My impression (as someone working on a compiler of my own, using LLVM), is that while you get a lot out of building on LLVM (portability across platforms, a collection of "free" optimizations, and a well defined low-level IR), it comes at the cost of a huge toolchain (takes about 5-10 minutes to compile from scratch on my laptop), and it is not stable across releases (the extent to which you are affected depends on the features your compiler depends on though). I think for toy projects it is probably more interesting to write the code gen yourself; and for projects focused on a specific platform, it is perhaps desirable to handle it yourself. Some people probably just don't want the dependency. My perspective is that I don't want to reinvent the work on cross platform support, or the optimization passes, and would rather take advantage of those efforts by depending on LLVM. I don't know that this makes sense for this project though.


LLVM is big, heavy, complicated, powerful. Kind of against the spirit of writing a lightweight C compiler. On the other hand, an x86 assembler can be easily implemented by one person and requires minimal cpu and memory resources.


Yeah, a fifth name appears in the macos branch. :) Not sure if github will show the name when macos is merged into master.

I wanted something small and easily self-hosting. A compiler on a floppy (as in old times). I hope you guys know what floppies are. :)

smlrc.c runs in 96KB on a PIC32 MIPS microcontroller.

LLVM is big and if I don't implement enough of the language (and I hear LLVM is s written in C++, not C), I can't recompile LLVM.

I can write an assembler. Or I can use FASM, which is self-hosting, if I can't compile NASM or YASM.


> Currently it generates 16-bit and 32-bit 80386+ assembly code for NASM that can then be assembled and linked into DOS, Windows and Linux programs.

I wonder why not 64Bit? Hardly anyone uses 32bit processors anymore, and even in the Windows world 64Bit systems are slowly taking over, so it would seem more logical to me to compile for modern processors, instead of 16bit architecture.


The x86_64 instruction set is a lot more complicated than the x86 one.


That is so but the subset of x86_64 that you'd use for code generation is pretty much the same as the subset of x86 that you'd use for code generation.


What do you have in mind here? What x86 code generated by a C compiler is not readily translatable to x86-64 code?


The additional registers would be useful. Even if sticking to the x32 ABI.


Going 64-bit is quite a bit of work:

- additional 64-bit types will need additional code for type checks and conversions, switch, etc

- new or improved code generator is needed (ideally, 64-bit non-pointer types should also be supported in 32-bit programs, possibly as register pairs)

- new object and executable formats to handle

- possibly new ABIs to support (pushing arguments onto the stack and never bothering to align the stack pointer onto a multiple of 8 or 16 boundary is kinda easy, it's pretty much free)

And the above work hasn't been done.


> Hardly anyone uses 32bit processors anymore,

On the contrary, nearly everyone has a processor that can run 32 bit programs now.


It seems there are C interpreters, I wonder how convenient it would be to use a C interpreter versus using LUA for example.

Although I've never tried using an interpreter in a program, bridging calls from a script inside your compiled program seems difficult.


Hold on... I thought due to the C preprocessor, you couldn't have just a single pass. Unless they aren't counting it as a pass?


I don't see any reason the preprocessor couldn't be integrated with the main parsing pass. It's all top-down (things defined before they're used) just like C.


How does it compare to pcc and tcc?


From a bootstrapping or Diverse Double-Compiling point of view, the real question is "Can compiler Y compile compiler X?"

As of a few months ago, it is possible to compile (an old, non-C++ version of) gcc with tcc, which is a good position to be in:

https://lists.gnu.org/archive/html/tinycc-devel/2017-05/msg0...


Can you then compile a new version of gcc with that old gcc? Or are there more steps in between?


According to:

http://gcc.gnu.org/install/prerequisites.html

"versions of GCC prior to 4.8 also allow bootstrapping with a ISO C89 compiler and versions of GCC prior to 3.4 also allow bootstrapping with a traditional (K&R) C compiler."

So you might need to build tcc -> gcc 3.3 -> gcc 4.7 -> gcc 7.2


Can't we skip gcc 3.3?


You mean tcc -> gcc 4.7 -> gcc 7.2?

I don't think that would work because gcc 4.7 is written in c++.


It's less complete in terms of the language and needs less memory. smlrc.c runs in 96KB of RAM on a PIC32 MIPS microcontroller.


This was posted before, right?

I like that they allow FASM.

And that there is no "standard library".


From the Github page:

> The standard C library is work-in-progress and it's close to completion.


There is standard library.


Yes, my mistake. I think this has been posted to HN before. I seem to remember trying it out.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: