CakeML – A Verified Implementation of ML

vog · on Jan 1, 2017

I find it interesting that CakeML, like many other developments in this area, is based on SML (Standard ML) and not OCaml (Objective Caml). Moreover, whenever I read something about ML languages, it seems most people in the academic field talk about SML.

Yet, it seems that OCaml is more popular among programmers and real-world projects. Even though these programmers come from the academic field, given the niche existence that OCaml still is. For example, the astonishing MigrageOS project chose OCaml instead of SML.

So my question is:

How is that? Why is OCaml so much popular, despite having just one implementation and no real spec? Why is SML with its real spec and multiple implementations not as least equally popular?

EDIT: Here are two possible answers that I don't think apply:

1. OCaml may be "good enough", which, combined with network effects, make choosing OCaml over SML a self-fulfilling prophecy. I don't think it is that simple, because OCaml users and projects come mostly from the academic field. They are deeply concerned with correctness of code. Which would mean they should all have favored SML over OCaml. In fact, sometimes correctness seems to be the sole motivation. For example, the author(s) of OCaml-TLS didn't just want to create yet another TLS library in a hip language. They are concerned with the state of the OpenSSL and similar libraries, and wanted to create a 100% correct, bullet-proof, alternative.

2. Although one could attribute this to the "O" in Objective Caml, I don't think it is that simple, because it seems the object-oriented extensions are almost unused, and wherever I saw them being used (e.g. LablGTK, an OCaml wrapper for the GTK UI library) I don't see that much value, and that sticking to plain OCaml Modules and Functors would have led to a better interface.)

Athas · on Jan 1, 2017

I work with a bunch of people who were very involved in SML, and I've asked this question many times. They usually say that SML (as a language) stagnated because of personal conflicts in the standardisation committee. A bit of a "worse is better" microcosmos, really, with a focus on on discussing theoretical perfection instead of how to get things into the hands of users. You can partially see this in play today in the Successor-ML[0] discussions, where there's not really any clear vision or leadership driving things forward, just a lot of (often very good) language suggestions.

My own extrapolation is that a multitude of SML implementations meant that programmers couldn't take advantage of implementation-specific extensions, as that would prevent their code from running elsewhere. This hindered real-world experimentation (as opposed to academic investigations) and organic growth of the language.

But, I should say that as starting point for creating new functional languages, SML is an exceedingly clean design. I'm using it myself for my own language design efforts.

[0]: https://github.com/SMLFamily/Successor-ML/issues

larsberg · on Jan 1, 2017

You can hear a bunch more about the full history of development of SML and interaction with OCaml here: https://www.youtube.com/watch?v=NVEgyJCTee4

I would definitely agree with the opinion above that the SML language designers fall in the camp of deliberating extensively and fully formalizing all new language features before considering them for inclusion.

Today, there are two challenges right now in the Successor ML space:

- There's some broad agreement on a few things that should obviously be improved, but today many of the implementation owners don't have a huge amount of time, so getting them to also agree to the new features is complicated. Life is way easier when you only have one implementation.

- There's a bunch of fantastic ideas (largely driven by Bob Harper) to integrate stuff like parallelism, cost model, etc. that will be fantastic for a next version of the language and specification. It's just a lot of work, and not really publishable, so it's mainly tenured faculty and "alumni" (like myself and probably the people you work with) driving it forward in our free time.

That said, I still love working on SML and its implementations (I mainly contribute to SML/NJ and Manticore) and it's a wonderful escape from the grim reality of quarterly goals and "get it out the door" release deadlines :-)

Athas · on Jan 1, 2017

> many of the implementation owners don't have a huge amount of time, so getting them to also agree to the new features is complicated.

That's another thing I noticed - people are not really prototyping the new ideas. I know that Bob Harper and John Reppy are also working on other projects, so they probably can't invest much implementation time on SML anymore. I wonder how OCaml and Haskell managed to keep implementers active.

The Successor-ML feature I'm just curious about is modular implicits (or some of the other alternatives mentioned by Harper). I much prefer the SML module system to Haskell's type classes, except that it is notationally more cumbersome in use. Sprinkling a little bit of implicit behaviour (when choice of functor is unambiguous) would help a lot.

larsberg · on Jan 2, 2017

> I wonder how OCaml and Haskell managed to keep implementers active.

Members of that community could probably comment better than I, but it doesn't hurt that Microsoft Research supported two core Haskell team members and Inria has supported core OCaml work for most of the lifetimes of the projects. Their better corporate focus has also meant that as some fabulously enlightented BigCos took dependencies on them (most notably Standard Chartered for Haskell & Jane Street for OCaml), they've both hired core developers and supported the communities. I should add that these have been fantastic for the PL community in general and specifically ICFP funding & attendance!

For as much as the members of any one community might have differences of opinion with the choices of the others, success of any benefit us all. Even if we do grumble occasionally with the "can't get into ICFP this year unless you're in language {X}!" comments, where X was Scheme, then ML, then Haskell... :-)

> people are not really prototyping the new ideas

Yeah, John's group plus Matthew Fluet and a little bit of myself are trying to clean up Manticore enough to do a "final" release and journal paper, and that is kinda taking up a lot of our extra cycles. Several other folks have either retired or moved on to new areas (security, compiler verification, etc.).

I think that if we could find some places where the new ideas intersected with interesting novel (read: fundable) research topics, there'd be some additional energy into implementation work around them. Right now, certainly, stuff exists, at best, in private forks of (typically) MLton.

mhd · on Jan 1, 2017

The old ML compilers were more deeply rooted in academia. It was almost popular to write a new one to fiddle with a new idea (like concurrency). Performance and more pragmatic considerations were a bit secondary and that included OS interfaces.

Caml had these, and thus was used to build some software somewhat popular on Unix systems (e.g. Unison, MLDonkey, a flash compiler). And while these days it might seem like we've got plenty of freely available, fast compiling, native languages to choose from, just a few years the situation was very different. So some just chose the language for that reason, never mind syntax or semantics. As long as it wasn't C.

So you just got the feeling of more dedicated, pragmatic language maintainers and actual community use. People who really care about purism went over to Haskelly anyway.

(Personally, I like the idea of languages with more than one compiler, if only the "Standard" part of "SML" would be a bit bigger)

agumonkey · on Jan 1, 2017

I wonder if the Objective part of it made it more lively in the mainstream eyes too instead of "stagnant/crude" sml.

mhd · on Jan 1, 2017

As far as I can remember, those were almost deprecated even way back when. It's never been much used for GUIs anyway, and I believe most servers and compilers were mostly written in a functional fashion.

agumonkey · on Jan 1, 2017

Mystery stays intact.

fmap · on Jan 1, 2017

I can add a few more data points. First regarding CakeML:

- CakeML is written in HOL4, which is using ML.

- SML, unlike Ocaml has a well-defined semantics and so there is a good starting point for a verified compiler.

Second, regarding Ocaml, the easiest reason for choosing Ocaml over ML is because there is only one implementation of the language that everybody is using. This makes it a lot more likely that the ocaml compiler will still be around a few decades from now. There are several modern ML compilers, most of which are more or less unmaintained. PolyML, which everybody seems to be using these days, is substantially developed by a single person.

nickpsecurity · on Jan 1, 2017

On top of it, SML was literally designed to do proving. Writing provers, feeding algorithms into them, and later extracting algorithms out. Ocaml was designed as a practical language for software development. Unsurprisingly, the provers and verified tools tend to be done in SML or a subset of it while real-world software is done in Ocaml.

That said, the CakeML team welcomes a translator for Ocaml that outputs CakeML programs. Additionally, the Ocaml compiler was clean enough in architecture that Esterel was about to do source-to-object code validation required for its DO-178B-certified, code generator. They said they had to do way less work modifying or analyzing it than they expected. That means the Ocaml compiler itself might be a candidate for verification albeit probably a non-optimizing form of it. I'd also try the K framework that successfully handled the C semantics with KCC compiler. If it can handle C, I'd venture a guess that it should handle a better designed compiler and language. ;)

EDIT to add: Ocaml syntax is being worked on at link below.

https://github.com/CakeML/cakeml/tree/master/unverified/ocam...

groovy2shoes · on Jan 3, 2017

> SML was literally designed to do proving.

So was Caml (and its precursor Le-ML), originally :)

That it grew so far beyond that I think is a consequence of it not having a formal definition like SML. This allows the developers to extend the language willy-nilly without going through the grueling task of extending a formal semantics. On top of that, as another commentor pointed out, is the fact that there's a single "reference implementation", which means that users don't need to worry about getting locked-in when opting to use new functionality.

That said, SML is still a fantastic base for new research projects in the PL space, mostly for the very same reasons it hasn't seen nearly as much adoption in the "practical" space as Caml: it's a small, simple, well-understood, and formally-specified language that grants researchers an excellent starting point for their inquiries and experiments.

nickpsecurity · on Jan 3, 2017

Didnt know that about Caml. Neat. Perhaps a subset of Ocaml with formal semantics can do well, too.

vkuncak · on Jan 1, 2017

It would be quite realistic and interesting to have a translation between the Scala subset that Leon project supports (leon.epfl.ch) and ML. A good starting point for this is Stainless, a redesigned version of Leon: https://github.com/epfl-lara/stainless

tempodox · on Jan 1, 2017

I can't speak for others, but from a purely practical and non-academic point of view, I chose OCaml over SML because it's painless and easy to produce a standalone binary executable. Most SML implementations seem to be able to do that too, but require me to write compiler-specific glue code, IIRC. I don't want to have to dive into compiler internals to produce an executable. Also, I appreciate that the OCaml maintainers are not afraid to provide useful syntactic sugar although it may not seem “real FP”, like the `for` loop for instance. I seldom use the OO extension, but support for virtual methods can be very handy at times. You can implement the mechanics for those with functors too, but it's just more code to write that way.

DanielBMarkham · on Jan 1, 2017

It may be similar to the reason for adoption for F# -- it does objects

Doesn't matter if you actually code with objects. The important point for adoption is that it does them.

So for a noob .NET programmer you say, hey, look at F#! You can code it just like C#. Kinda.

But as soon as they start coding, you tell them nope, all of these types of things are actually antipatterns.

jetti · on Jan 1, 2017

I disagree with your premise on why F# was adopted. It isn't because it does objects but it was really the first functional language that was pushed by Microsoft and ran on the .NET platform. Had F# not run on the CLR I don't think it would have nearly as many users as it does now.

WorldMaker · on Jan 2, 2017

That goes into why Microsoft (Research) chose to base F# on OCaml rather than SML (or Haskell, which Microsoft Research also has some fingers in that pie): it's not that useful targeting the CLR if you can't interact with other things running on the CLR (including the Base Class Library [BCL]). Presumably Microsoft could have have tried for something wild like trying to build a "CLR Object monad" for Haskell, but the F# team instead went with the existing, known commodity language family that already straddled that border between object systems and functional programming (OCaml).

titanix2 · on Jan 1, 2017

I think people wanting to use C# will stick to it, not learning another language to use it the same way as the one they already know.

The real value of objects in F# lies in the CLR compatibility. Access to the BCL and some existing code is a big advantage over OCaml for .neteers.

rwmj · on Jan 1, 2017

We use OCaml for the virt tools because it (a) generates efficient, standalone binaries and (b) lets us call out to C libraries really easily. We use OCaml over (say) C or C++ because it allows us to write code with far fewer bugs and headaches. You can find one of these large, real world OCaml programs here: https://github.com/libguestfs/libguestfs/tree/master/v2v

I think network effects do matter much more than you say. There is virtually no community around SML.

mafribe · on Jan 1, 2017

Summary: CakeML is the first verified optimising compiler that bootstraps.

Side note: Cake stands for CAmbridge KEnt, which is where (most of) CakeML's verification was carried out.

The pioneering project in this space was X. Leroy's CompCert. This was the first verified optimising compiler. More precisely, a realistic, moderatly-optimising compiler for a large subset of the C language down to PowerPC and ARM assembly code.

aisofteng · on Jan 1, 2017

Could you expand on the significance of the first, for those of us not familiar with formal verification?

Is this is a first because it is is theoretically difficult to do, or because it requires a lot of implementation time? What are some key points to read up and understand in order to properly appreciate this result, past the Wikipedia article on formal verification [1]?

Thank you in advance for any elaboration.

[1] https://en.wikipedia.org/wiki/Formal_verification

mafribe · on Jan 1, 2017

The problem with verifying realistic compilers is scale. We have known how to do it in principle since forever, and verification of toy compilers is part of textbooks on verification, such as [1], see also [2]. Realistic compilers are very complicated and Leroy's verification of CompCert took several man years for one of the world's leading compiler and verification guys. The purpose of research like CompCert and CakeML is twofold:

- Provide a verified software toolchain for programmers with a minimal trusted computing base.

- Investigate how the cost (in a general sense) of formal verification in general and compiler verification in particular can be lowered, ideally to the point that normal programmers can routinely use formal verification.

The advance that CakeML makes over CompCert is bootstrapping: CakeML can compile itself, while CompCert (being a C compiler written in Ocaml) can't. Simplifying a bit, bootstrapping lowers the trusted computing base.

Maybe Leroy's [3, 4] are good starting point for learning about this field.

[1] T. Nipkow, G. Klein, Concrete Semantics. http://www.concrete-semantics.org/

[2] A. Chlipala, A verified compiler for an impure functional language. http://adam.chlipala.net/papers/ImpurePOPL10

[3] X. Leroy, Verifying a compiler: Why? How? How far? http://www.cgo.org/cgo2011/Xavier_Leroy.pdf

[4] X. Leroy, Formal verification of a realistic compiler. http://gallium.inria.fr/~xleroy/publi/compcert-CACM.pdf

nickpsecurity · on Jan 1, 2017

No it goes further than that. They embedded assembly languages & many aspects of computation in HOL then built their compiler. The thing goes straight from logic to assembly with the theorem prover being the TCB outside the specs themselves. Whereas, CompCert was specified in Coq, would probably be extracted to an ML, and then that whole pile of code (hopefully verified to assembly) would do the job. Unless they're doing all the compiles in Coq itself w/ its checker. This is the part I could be really wrong on.

The TCB reduction is huge. Also, seL4 organization built the Simpl embedding of C in that to do "translation validation" (due to Jared Davis) of it straight to or matched against assembly. Skips the need for a CompCert-style, verified compiler altogether. Myreen et al's techniques were also used to verify theorem provers and now hardware.

So, the CakeML effort and its effects are huge. Maybe more so than CompCert given the flexibility & fact that it's a proprietary product now whereas Myreen et al's stuff is open. That's what I said back when I saw it. The prediction was confirmed as COGENT was built on the same technology with amazing results so far in cost of verification:

https://ts.data61.csiro.au/projects/TS/cogent.pml

aisofteng · on Jan 1, 2017

Thank you.

gravypod · on Jan 1, 2017

The problem with optimizing compilers is that you're changing what the user expects to come out. This is very dangerious because you need to be able to prove that what you've generated will work as the non-optimized version.

I know of a few programs that when compiled with -O2 work fine but break with -O3. This is because some optimizations that are applied are just not what the programmer expected or in the older days just broken working code.

Formally verifying the output is difficult because you need to prove the same operation is happening both times even if they are extremely different in what they are doing. I'm assuming that's where the difficulty comes in.

nickpsecurity · on Jan 1, 2017

Hence the Vellvm project and its ilk:

https://www.cis.upenn.edu/~stevez/vellvm

nickpsecurity · on Jan 1, 2017

Everyone with formal method background interested in this work should consider taking on one of their posted projects that would improve it. Especially Ocaml to CakeML translator.

https://cakeml.org/projects

Just email them first in case someone has done the work already. Academics sometimes are slow to update web sites due to digging deep into their research. ;) The best uses I can think of for CakeML are:

A reference implementation to do equivalence checks against with main language, a ML or not, being something optimized.

Someone to build other tools in that need high assurance of correctness. Prototype it to get the algorithm right using any amount of brains and tooling that already exist with an equivalent CakeML program coming out. Then, that turns into vetted object code.

A nice language for writing low-level interpreters, assemblers, or compilers that bootstrap others in a high-confidence way. Idea being in verifiable or reproducible builds where you want a starting point that can be verified by eye. They can look at the CakeML & assembly output with some extra assurance on top of hand-doing it. One might even use the incremental compilation paper on building up a Scheme to end up with a powerful, starting language plus assurance binary matches code.

gravypod · on Jan 1, 2017

Go into the compiler explorer [0] and type the following

    val num = 10

Then take a look at the x86 generation.

What is all of that. It doesn't look like executable data needed. Is that just implicit functions or something baked into the language? If it is, why isn't it being tree-shook?

junke · on Jan 1, 2017

It seems you need to add a semi-colon after the line (the AST was empty). A minimal program is simply "();", and compiles into something even more complex. A big part of it looks like runtime setup/cleanup code, common for all programs.

clarus · on Jan 1, 2017

It seems that there are some examples here: https://github.com/CakeML/cakeml/blob/master/explorer/exampl... Whould be cool to have these examples accessible from the web interface of the compiler explorer.

gravypod · on Jan 1, 2017

Sadly it doesn't seem like you can link to a compiler output from their compiler.

Twirrim · on Jan 1, 2017

What is ML in this context? Neither CakeML nor Standard ML site appear to actually define it, and it's an acronym with a few definitions in tech (e.g. Machine Learning)

more_original · on Jan 1, 2017

It's Meta Language. The ML language was developed in the early 1970s as a meta-language for the LCF theorem prover. ML was developed as a language for programming proof tactics. The strong type system and type soundness guarantees of ML were important to guarantee that such tactics could only prove correct theorems.

https://en.wikipedia.org/wiki/ML_(programming_language)

jez · on Jan 2, 2017

For a bit of history about ML, see Appendix F: The Development of ML (on page 89) from here:

http://sml-family.org/sml97-defn.pdf

igravious · on Jan 1, 2017

https://en.wikipedia.org/wiki/ML_(programming_language)

fithisux · on Jan 2, 2017

What is the difference between CakeML and StandardML in terms of Syntax and Semantics?