CompCert - Compilers you can formally trust

davidtgoldblatt · on June 4, 2011

Also worth reading: the paper that found bugs in the proven-correct compiler: http://www.cs.utah.edu/~regehr/papers/pldi11-preprint.pdf

(okay, so only a portion of the compiler is proven correct, and they didn't find any bugs in that portion).

Note also that a compiler with a proof isn't necessarily correct - it just means that you've pushed all your bugs into your description of the rules of C and of the target architecture. Since neither C nor any common CPU architecture have a formal semantics, someone has to read the (english) description and convert it to a description in their theorem proving software, and there's no reason to believe this will be easy.

skew · on June 5, 2011

One thing I missed the first time reading the paper is that those were the only bugs they could find in the compiler. After reporting them and getting fixes, several CPU-years more of random programs haven turned up another failure.

All the other compilers pick up tens of failing cases a night, which they skim off and report as their previously reported tests are fixed (the LLVM and GCC teams are quite responsive, others less so).

This kind of differential testing also provides some evidence that the formalization isn't too far off - each "passing" case means that almost all the compilers produced exactly the same results for the program.

Both projects are awesome, especially if you like reliable compilers.

schrototo · on June 4, 2011

This old Knuth quote seems apt:

"Beware of bugs in the above code; I have only proved it correct, not tried it."

cwzwarich · on June 4, 2011

That seems like it either a statement about the nature of informal proof or the disconnect between the proof and the code. How does that apply to something written in Coq, where the program is extracted from a mechanically verified proof?

shasta · on June 5, 2011

Mechanical proof of what, though? The spec can fail to capture intent.

njs12345 · on June 5, 2011

In this case, the semantics of the C subset chosen are captured perfectly by the assembly produced by the compiler.

Seems like that's the intent of a compiler to me :)

shasta · on June 5, 2011

Right, but first you have to specify "the semantics of C" and "the semantics of assembly". How do you prove you did that right?

cwzwarich · on June 5, 2011

Compcert uses small-step operational semantics for both the source language and the assembly output (and intermediate languages that are irrelevant to the final statement of the semantic preservation theorem).

Specifying such a semantics is basically like writing a very simple functional interpreter for the language. I don't see how it could get much simpler. Here is the description of the semantics for their input language:

http://compcert.inria.fr/doc/html/Csem.html

and their PPC output:

http://compcert.inria.fr/doc/html/Asm.html

shasta · on June 5, 2011

Yes. The approach is probably as good as it can get. And yet, it's not perfect. That functional interpreter could have a bug. And thus Knuth's quote is applicable. Testing the compiler could reveal to you that your specification wasn't really what you intended.

njs12345 · on June 5, 2011

The semantics of the dialect of C that CompCert accepts are precisely defined, as are the semantics of the assembly outputted (that is, it's not strictly x86/PPC/ARM or whatever but some subset with defined semantics which happen to line up perfectly with observed behaviour). The first is fine I think (accepting a subset you define formally); the second is an issue but it's currently impossible to do any better. Still, if you look at this compared to the backends of LLVM and GCC it's certainly an improvement.

archgoon · on June 5, 2011

Then the spec is wrong, and there are no correct compilers.

shasta · on June 5, 2011

There is already a C spec and it isn't formalized in coq. You can formalize it incorrectly.

click170 · on June 4, 2011

I'm confused. The title made me think "list of compilers I can trust", but the website provides no such thing. It looks to me like they've written their own compiler.

p4bl0 · on June 4, 2011

Yes. CompCert is a project of the French research institute in computer science and automation (INRIA). The goal is to have a certified compiler for the C programming language. I believe they're writing a C compiler in Coq (a proof assistant also developped mainly at INRIA).

The main guy behind CompCert is Xavier Leroy (already known for OCaml).

nakkiel · on June 4, 2011

There's a list of papers published on his page:

  http://cristal.inria.fr/~xleroy/

I keep this link in a feel-good folder, next to Bellard's page. Sometimes I need a feel-good break after reading US-centric news.

MostAwesomeDude · on June 4, 2011

Yeah, the big thing here is that almost all of this thing is proven, meaning that the compiler's behavior is well-defined and correct, all the time, on all inputs.

I still had a jaw-drop moment the first time I heard about this; Coq is not a pleasant language, and proving this thing must have been a nightmare.

justincormack · on June 4, 2011

What does it do about the undefined behaviour parts of the spec?

skew · on June 5, 2011

You can see the statement here: http://compcert.inria.fr/doc/html/Compiler.html#transf_c_pro...

A C compiler doesn't have to promise much of anything about the behavior of the code produced for a program with undefined behavior.

This statement could be vacuous for a program p with undefined behavior if (Cstrategy.exec_program p beh) isn't true for any behavior beh, or only true for a behavior which makes (not_wrong beh) false.

If they don't always, then they actually give you a guarantee for some bad programs as well. Nothing in the spec says running a bad program has to make demons fly out your nose.

beza1e1 · on June 4, 2011

The frontend is not verified. This part parses, checks and lowers C code to a Clight. There is no formal specification of C itself and that would be the first step to verify the frontend.

jgmatpdx · on June 4, 2011

It is a list of formally verified C compilers. At the moment, there's one.

protomyth · on June 4, 2011

I seem to remember a CPU that was stack-based that was built for correctness. I think it was called the Viper, but I cannot seem to find it. It would be logical to me to run your formally trusted compiler on a chip designed the same way.

daveasdf · on June 5, 2011

You might be thinking of Verisoft and their "VAMP" hardware stack: http://www.verisoft.de/SubProject2.html

The original goal of the project was to have a verified computer stack, with proofs going from the software, compiler and processor all the way down to the gate level.

I am not sure how far they got, but I don't think the project is still active, which is a bit sad.

beza1e1 · on June 4, 2011

To be realistic: The frontend is not verified. Only very few optimizations are done. It only seems to have a PPC backend.

However, i still think verified compilers are the future. It will just take two or three decades until we see them at LLVM quality.

cwzwarich · on June 4, 2011

It now has X86 and ARM backends.

kemiller · on June 4, 2011

Would someone explain "formally trust" in this sense?

dons · on June 4, 2011

The trust you place is based on mathematical evidence for the soundness of the compiler. The compiler is thus trustworthy, based on formal arguments, as opposed to informal arguments (such as "we did some testing" or "no bugs are were reported").

Going back to Kant, formal arguments in this sense are a priori knowledge -- knowledge that is in some sense objective, and independent from any observer making empirical measurements (as opposed to a posteriori, or experimental, knowledge).

kemiller · on June 4, 2011

Thanks. My school's CS department was very strong in theory, among other things, and so there were a lot of very mathy people in the program mixed in with the more practical proles. Most of them were very nice, but there was something of a communication problem because (I assume) they were too naturally talented at such things to be able to explain them very well. I remember one guy who would get this look of satisfaction on his face every time a teacher would start talking about formalisms, as if he were eating his favorite food, or sitting down after a long day on his feet, and that's my first association with the word "formal" in a CS context.

Jebdm · on June 5, 2011

A little bit of explanation about formalisms and the communication problems you mention from a computer scientist/linguist who gets that same look of satisfaction:

In natural languages, the relationships between utterances and what they convey are somewhat fuzzy. Words like "tree" and "run" don't have single precise meanings; they have a variety of senses associated with them, and these senses vary in how closely they are related to each other. Biological trees made of cells are quite closely related to computer science trees by their shared shape; the relation between running with your legs and running for office is a bit fuzzier, via the race metaphor; the relation between a river bank and a financial bank is even more unclear.

Meaning in natural language is built in aggregate from associations, with the senses of expressions converging over time in groups of communicating speakers. Natural languages do not clearly define the relationships between different expressions, nor between expressions and the real world. Because natural languages are unspecified and evolving, you can only measure how often (in your observation) different expressions and relationships between expression reflect different situations.

Formalisms, like natural langauges, are also used to represent meaning via expressions made of symbols and relationships between them. (Technically, formalisms don't have to represent anything, but in practice they usually do because otherwise they're not very useful.) Unlike natural languages, formal languages have normatively defined specifications. These specifications can be written in a variety of ways, but usually they define rules (which always apply) for deriving expressions from each other.

To be able to use formal languages for real-life problems, you need to have these five skills:

Symbol manipulation: deriving expression in desirable forms from old ones using a specification Interpretation: relating formal expression to the domain Encoding: capturing domain knowledge in the formal language Abstraction: finding consistent patterns in the domain (to be expressed in the formalism) Modeling: expressing relationships in the domain using relationships between symbols

Unfortunately, most people are not very good at these skills. It is hard to break habits like using intuition and relying on others' ability to resolve ambiguity. Furthermore, most people are not very good at dealing with abstractions, especially new ones or higher-level ones. To make things worse, some of the problems involved in using formal systems are genuinely hard, even for computers and mathematicians. Obviously, even the most accomplished mathematicians are not always able to get the results they are looking for, let alone quickly, but people who are good with formalisms usually don't have as many problems with things like checking their intuition or dealing with abstractions.

The problem with communicating formal ideas to people who are not good with formalisms is rather fundamental, because the very attributes which make formalisms useful are the ones that those people have trouble understanding. You can, of course, relate the results of using a formalism, but without using formalisms it is often very hard to usefully express the "why" of the result. The problem is compounded because people who are good with formalisms have a hard time figuring out what is understandable to others, due to the general "knowing it too well to teach it" effect as well as lack of practice (since most people who care about formalisms are good with them).

Some talented communicators are able to explain formal results satisfactorily to others by relating them to familiar abstractions or rephrasing their logic in everyday terms, but unfortunately this frequently results in misunderstanding and misapplication of the results. Furthermore, there are a number of formal results which really can't be properly understood unless you have mastered the skills above (particularly in theoretical computer science and formal logic). The problem is akin to trying to talk to non-programmers about programming, but more general. Programming languages are formal languages, of a sort, but most programmers only understand a few particular formalisms. Theoretical computer science and proof theory involve dealing with a larger and more abstract set of formalisms, including formalisms about formalisms.

If you need examples of the difficulties, just look at all the ways that people try to explain Gödel's incompleteness theorems or monads informally.