Domain-specific compilers are a bigger field, where there's more opportunity to do something interesting
Take C (or some other language) and its mature optimizing compilers as a foundation, and write a program that compiles a high-level description of a domain-specific program into C
For example, NN-512 (https://NN-512.com) is a compiler that takes as input a simple text description of a convolutional neural net inference graph and produces as output a stand-alone C implementation of that graph
GNU Bison (https://www.gnu.org/software/bison/) is another example, where the output is a parser: Bison reads a specification of a context-free language, warns about any parsing ambiguities, and generates a parser (either in C, C++, or Java) which reads sequences of tokens and decides whether the sequence conforms to the syntax specified by the grammar
Halide (https://halide-lang.org/) is a language for fast, portable computation on images and tensors, and it originally generated human-readable C++ code as output (but not anymore, now it feeds directly into LLVM)
Can anyone provide other examples? (Code generators for network packet processing?)
> Take C (or some other language) and its mature optimizing compilers as a foundation, and write a program that compiles a high-level description of a program into C.
Don't you still need many of the techniques taught in this class to build something like that? The material doesn't look inherently targeted at low-level languages or machine-code as a target.
One way to think of it is to split the "compiler" problem into three big pieces. 1) Front-end, 2) back-end, 3) tooling.
Front end is lexing, parsing, building the AST (in whole or just keeping pieces of it around), semantic analysis. When that is done, you can say "Yup, valid program. It is possible to generate code." So, to your question, yes. Those techniques from the course are 100% in play.
Back-end is turning the AST into something you can optimized and generate code from. (Sometimes there is a "middle-end" in there....) High-level optimizations, low level optimizations, memory allocation, register allocation.
Tooling is everything that keeps your life from being miserable as a user -- starting with debug symbols and these days people expect library and package management, etc, etc.
So if you are interested in exploring the Next Great Syntax or some semantic problems that can be re-written into C, then doing a C-front is a great way to do that. Let the C compiler handle all the really-really-really-hard back-end issues for you, and avoid all that distraction. But.... expect that getting debug symbols from your spiffy language into the ready-to-link object file is going to be, as they say, "non-trivial". So... great for experiments and a great learning exercise, but it's hard to do a production compiler that way.
Hi! This course is about the "middle end," FWIW. We do not do parsing or codegen in 6120, and there is no goal (as there is in many undergrad-level compilers courses) of making a complete, end-to-end compiler for a C-like language.
I have just spent the last 13 months significantly reworking a compiler that turns Excel-like formulas into JavaScript. The compiler is used by an app designer, and the formula language is meant to be easy to pick up for people used to spreadsheets. A significant difference compared to Excel is that our compiler features static typing, enabling helpful error messages to be produced at build-time.
(Spreadsheets, on the other hand, tend to try very hard to make sense of types at runtime, using type coercion to turn values of the wrong type into something of the expected type they can work with. That produces surprising results and confuses users. =IF(2, 3, 4) will generally produce 3, because 2 -- which should be a logical/boolean value -- is interpreted as TRUE. We think it's a better idea to warn the user that the first parameter is wrong, and to do that as soon as possible.)
By far the largest challenge with reworking the compiler has been the type system. I have expanded our type system from the four standard spreadsheet types to encompass more than 80, as well as added support for lambdas (for use with functions like SUMIF and FILTER), type inference, parameterized types and union types. The biggest addition is probably an imperative mode, that executes formulas with side-effects in response to events being fired. This is all in service of features we have wanted to add for years, that have been held back by our (previously) simplistic type system.
I did take a compiler class at university, more than 15 years ago, but while that proved very useful when writing the formula parser (an operator-precedence parser, initially straight out of a textbook), type theory was not part of the class. I had to pick up bits and pieces by reading the Java Language Specification (JLS) and by experimenting.
The Cornell class doesn't seem to feature type theory either, which I think is unfortunate. Now that the pendulum has swung, yet again, firmly in the direction of static typing, perhaps static typing will become a bigger part of CS compiler curriculums.
Your project sounds like fun. What/for whom is the project? What level of Excel formula compatibility are you shooting for, e.g. just in the neighborhood like Google Sheets, reasonably close as in Libre Office, or full compatibility with the latest version? If the latter you must have one gnarly test suite. Would love to hear more about it.
The compiler is for Calcapp, an app designer for Excel-savvy users. Our aim is to offer a function library that is easy to pick up for our target audience. That means that we generally follow Microsoft's lead in terms of naming and parameters, but we often offer extensions. Here are a few examples:
* Functions like SUMIF operate on a range or an array and use a textual mini-language to determine which values to operate on. =SUMIF(B1:B2, ">3"), where B1 contains 2 and B2 contains 4, returns 4, because only B2 satisfies the ">3" condition. We're not big fans of that mini-language (which expects you to use the string concatenation operator & to handle dynamic values). We support it, through an overloaded function, but also offer a version which takes a lambda as the second parameter. In Calcapp, SUMIF({ 2, 4 }, Item > 3) and SUMIF({ 2, 4 }, E -> E > 3) (with a named lambda parameter) are equivalent to SUMIF({ 2, 4 }, ">3"). The lambda syntax enables much more complex conditions to be expressed.
* Some functions like SORTBY take a sort order parameter, which is an integer where 1 means sort in ascending order and -1 means sort in descending order. We support functions with integer sort order parameters, but also offer an overloaded version taking an enumerated value instead (SortOrder.ASCENDING or SortOrder.DESCENDING), which we think helps with readability.
* Excel's logical operators (like =, <> and <=) return arrays when applied to arrays and ranges. For instance, { 1, 2 } = { 1, 3 } returns { TRUE, FALSE }. (This is easier to see in recent versions of Excel, where array results "spill," meaning that a return array with two elements spills to occupy two cells of the grid. In earlier versions, as well as Google Sheets, you'll have to use INDEX to tease out the results.) We support = and !=, but also == and !=, where the two latter operators always return scalar (non-array) values. == enables users to determine if two arrays are identical, without having to use AND on the result array (which is the standard Excel trick). Also, our == operator is more traditional in that string comparisons are case-sensitive (unlike =).
* Experienced Excel users like to use double negation to turn logical arrays into number arrays. We have a specialized operator, --, for that, because allowing two unary operators to follow one another would introduce other issues. Here's an example: https://exceljet.net/excel-functions/excel-sumproduct-functi...
* The Excel operators * and + can be used with logical arrays and ranges to signify logical AND, as well as logical OR, respectively. (https://exceljet.net/formula/sum-if-cells-contain-either-x-o...) We have overloaded versions of those operators that take logical arrays for that reason. (Again, Calcapp uses static typing, which makes this a lot more complicated for us than it is for Excel.)
* The latest versions of Excel have great support for dynamically-allocated arrays (along with new array functions like FILTER and SORT). Their new calculation engine supports users providing arrays where, traditionally, non-array parameter values are expected. If arrays are provided for these parameters, then an array is returned. For instance, =SQRT({ 4, 16 }) returns { 2, 4 }. Microsoft internally calls this "lifting" parameters. We refer to these parameters as being "array-capable." In our type system, an array-capable type is a union type (T|Array<T>), with the notable difference that if an array is provided to an array-capable parameter, then the return type is no longer T, but Array<T>.
We do have a test suite for the runtime formula function implementation with a couple of thousand unit tests. Given the number of supported formula functions, though, we should probably have many more.
Most of what I have described here pertains to a new Calcapp version, which hasn't been released yet. In particular, while I am done implementing the compiler, it still needs to be documented (lots of Javadoc), user-facing documentation needs to be written and the runtime system needs to be implemented, so we still have a few (fun!) months to go before everything is ready.
AnyDSL (https://anydsl.github.io/) might be worth mentioning, it is a framework for creating domain specific languages using partial evaluation to give powerful abstractions to the user while still producing high-performance code for different target platforms.
On the website, you can find that they already have DSLs for ray tracing and image stencil codes as well as works for bioinformatics.
Not open source but I'm working on a system that turns a DSL (embedded into Jupyter notebooks) into either c or c with opencl for a niche finance application. Not a huge project (<10k lines of code). Nature of the problem means that DSL can be very simple (it's aimed at non-programmers). This approach also means that could easily add other targets - e.g. cuda or wasm.
Take C (or some other language) and its mature optimizing compilers as a foundation, and write a program that compiles a high-level description of a domain-specific program into C
For example, NN-512 (https://NN-512.com) is a compiler that takes as input a simple text description of a convolutional neural net inference graph and produces as output a stand-alone C implementation of that graph
GNU Bison (https://www.gnu.org/software/bison/) is another example, where the output is a parser: Bison reads a specification of a context-free language, warns about any parsing ambiguities, and generates a parser (either in C, C++, or Java) which reads sequences of tokens and decides whether the sequence conforms to the syntax specified by the grammar
Halide (https://halide-lang.org/) is a language for fast, portable computation on images and tensors, and it originally generated human-readable C++ code as output (but not anymore, now it feeds directly into LLVM)
Can anyone provide other examples? (Code generators for network packet processing?)