C as an intermediate language (2012)

tikhonj · on July 17, 2013

If you want to use C as intermediate language, you may be interested in the CIL project[1]. It stands for "C intermediate language"--might be relevant :P.

Fundamentally, CIL is a nice subset of C wrapped up into a nicely usable API. The idea is to shave off as many of C's inconsistent sharp edges as possible. You won't have to worry about quirks in the syntax or odd behavior because you have a nice high-level, curated API for generating C code.

It's designed to have a clean semantics, and semantics are very important. (Or so I maintain.)

[1]: http://www.cs.berkeley.edu/~necula/cil/

mkehrt · on July 17, 2013

I've used CIL (if only for school projects), and I can vouch for it being a joy to use, at least if you like OCaml. Also, their page on wacky C edge cases ("Who says C is simple?", link to frame contents: http://www.cs.berkeley.edu/~necula/cil/cil016.html") is great.

pjmlp · on July 17, 2013

For anyone wanting to read about C as intermediate language,

Compiler Design in C (1990)

http://www.amazon.com/Compiler-Design-C-Prentice-Hall-softwa...

EDIT: Adding some extra remarks I think might also be interesting to share.

Another approach, that I really like, is to output bytecodes that are mapped directly to macros in typical macro assemblers like NASM/MASM/TASM. Those macro assemblers provide very powerful macro systems.

Then map those macros to the corresponding assembly code.

Sure it gives a bit more work, but I find it more fun.

argv_empty · on July 17, 2013

From a look over the Amazon page, the book seems to be about writing a compiler in C, not about writing a compiler targeting C. Does it actually describe potential issues with compiling to C?

pjmlp · on July 17, 2013

It is an old book about implementing an C compiler in C.

The intermediate code is similar to the article, a mix of macros and basic C expressions as high level assembler.

I cannot remember all the details, the last time I used the book was around 1996.

You can still get the source code, http://www.holub.com/software/compiler.design.in.c.html

octo_t · on July 17, 2013

here's my one line C compiler (compiles to C):

  touch *.c *.h

CJefferson · on July 17, 2013

One advantage of going through C (which admittedly might not pay off) is that you get the advantage of compiler optimisations.

When compiling dynamic languages of course, that often doesn't work as the optimiser doesn't have anything to go on if you do everything through void* pointers, or a similar dynamic construct.

pjmlp · on July 17, 2013

Agree.

You might find some anti-C bias from my post history, but I am also very pragmatic and would certainly use C if required to do so.

I used the compilation to bytecode disguised as Assembler macros as part of a few toy compilers back in late 90's.

If it was today I would probably use LLVM or C(++) as being discussed in the article.

Roboprog · on July 17, 2013

C: AKA "portable assembler" :-)

pjmlp · on July 17, 2013

TI has a DSP processor where the assembly looks quite much like the early days of C:

    mov reg1, reg2    ==>  reg1 = reg2
    mov reg1, [reg2]  ==>  reg1 = *reg2
    add reg1, reg2    ==>  reg1 += reg2

And so on.

Roboprog · on July 18, 2013

looks sort of like PDP 11 assembler, particularly the [] brackets for indirect access through a register.

pcwalton · on July 17, 2013

Some of these aren't as simple as they seem at first glance:

> Dynamic binding: easy enough.

That technique won't be fast enough for dynamic languages. You really need polymorphic inline caches to make dynamic binding fast, which requires careful cooperation between the code generator, the front end, and the IR. A C compiler won't give you that level of control.

> Garbage collection

It doesn't work that well. The problem is that C compilers don't provide a way for the runtime to find all roots on the stack (tell apart integers from pointers). You pretty much either have to be conservative on the stack or spill all roots to the stack across function calls. Neither of them is very good: the former costs accuracy and prevents you from using a bump allocator in the nursery, and the latter costs performance.

There are other issues to consider as well, for example tail call optimization and undefined behavior.

yosefk · on July 17, 2013

I agree that compiling to C will not give you a great language implementation if you need this type of features; I think the implementation will be good enough for many cases though. To take an extreme example, CPython isn't a bleeding edge Python implementation perhaps, but it's still the most popular and practically relevant one; this is in fact true for many popular dynamic languages, one big exception in recent years being JavaScript.

Languages where a really great implementation might involve C code generation are probably indeed quite static, however. An example is Synopsys's VCS which AFAIK compiles Verilog to C++.

My own hands-on experience with this type of thing is with an in-house HDL and an in-house C dialect with (sizable, static) extensions for accelerator programming.

zrail · on July 17, 2013

I worked at a company that did a very large amount of data processing on a relatively small number of machines using a well-optimized c++ library. The library did everything, including the calculations and the data storage. This made it hard to write one-off queries that the executive team would request from time to time, since we would write a custom c++ program every time.

One day we came up with an idea: what if we could query the data store with SQL? The first iteration actually attempted to embed sqlite3 into the data access layer, which was functional but extremely slow because of all the type marshaling going on. A coworker and I came up with the second iteration which worked like this: a custom SQL-like language would be parsed by a Perl program using Parse::RecDescent, a recursive parser generator. The parse tree would then be translated into C++ that used the data access layers and processing layers directly. The compiled program was distributed to the cluster in the same way as the daily processes.

As far as I know this monstrosity still gets daily use, four years later.

octo_t · on July 17, 2013

There's a lot of extra stuff you get with targeting C: being able to write the run time in C very easily. For example its a lot easier to write your entire OO system in C in a few hundred lines, and keeping that easily debuggable is a massive time saver.

lmm · on July 17, 2013

Wouldn't that be even easier if you used LLVM IR, as you'd then be able to write your runtime in any language that LLVM supports?

limmeau · on July 17, 2013

Advantage of writing it in C: You can prototype an example client of your runtime in C before your code generator is working.

lmm · on July 17, 2013

Again, why couldn't you do that with LLVM IR?

luikore · on July 17, 2013

LLVM IR is designed for machine, not human, it's too verbose, and requires SSA form. A normal person can't easily reason the logic in such verbose language. Plus people are more familiar with C.

lmm · on July 17, 2013

Sure, but you can write your runtime test code in C - or any other language with an LLVM compiler.

__alexs · on July 17, 2013

In C you can write you run-time in any language that C can link to. Which is rather a lot of languages, and isn't tied to any particular compiler.

lmm · on July 17, 2013

Yes and no. You can, but if you wrote your runtime in e.g. Haskell or Python you'd lose most of the advantages the article describes - debugging and tracing would be much more complicated for anything that called into the runtime, and profiling would become very difficult.

As a linkage format C is both too high-level (it provides a lot of facilities that are irrelevant to this use case - so while C may be ubiquitous now, I suspect it's much easier to implement an interpreter from scratch for something like LLVM bytecode) and too low-level (it exposes the host machine's memory model, making it inherently unportable). Use the right tool for the job - programming languages for writing programs in, intermediate representations for code for representing intermediate code.

yosefk · on July 17, 2013

LLVM IR is less portable than C; and even where an LLVM back-end is available (say, x86 Windows machines), it's really nice to be able to compile everything with another compiler (say, VS) and get debug info and browse info in the generated files.

rayiner · on July 17, 2013

You lose a lot of things the C compiler does for you by targeting LLVM directly. A big one is debug info. The C compiler will emit debug information for you, and with some line directives, as shown in the article, or maybe scripts, you can get minimally usable source-level debugging without having to generate your own debug info.

pcwalton · on July 17, 2013

LLVM has high-level support for debug info. (You have to actually generate it, which isn't free, but neither is generating #line directives, so it's arguably a wash.)

Roboprog · on July 17, 2013

I worked at a company ("Morada Corp") in the early 90s that did just that for ... RPG II!

We took the code from IBM minicomputers and compiled it into C on many, many platforms (back when there were a few more unices, as well as OS/2 and VAX/VMS kicking around).

I only did a little maintenance on the compiler front end checker, though. I mostly worked on some supplemental tokenizers/runtimes for data file browser language and a DB/form DDL.

Anyway, GCC made a nice target to hit a large number of systems. (alas, Borland C on DOS at the time tended to choke on larger generated subroutines, being 16 bit w/out "huge" pointer support and all)

chalst · on July 17, 2013

An intermediate language (or representation) is the internal representation of code used for program transformation and reasoning about the code. This is not what the story is talking about - all the benefits are about the object representation (the compiler emits C code), and the presented compiler does not distinguish between the intermediate representation and the object representation; it does not need to, as it does no optimisations. There's nothing innovative about emitting C code from a compiler - it used to be more common than it is now.

There has been a little bit of research done on writing optimising compilers that use source-code-like intermediate representations: the Janus project from the 1990s springs to mind.

gjndrtjh · on July 17, 2013

Ć programming language (http://cito.sourceforge.net/) compiles C# subset to C.

The C code generator is very simple: http://sourceforge.net/p/cito/code/ci/master/tree/GenC.cs

bane · on July 17, 2013

Didn't Lex and Yacc pretty much use C as an intermediate language?

joshuaellinger · on July 17, 2013

memsql is doing the same thing with SQL.

harder than it looks.