Resources on creating programming languages

justinpombrio · on April 21, 2017

I was cynically expecting only a bunch of resources about parsing and compilation, but was pleasantly surprised that the first section was about language design!

Here's a shameless plug for Brown's PL course last year. The videos and assignments are all online. The assignments involve exploring mystery languages, and aim to teach a bit about the consequences of language design. Next year we might do a MOOC.

http://cs.brown.edu/courses/cs173/2016/index.html

dyarosla · on April 21, 2017

I liked the fact that it started with language design too. One of the links points to a Paul Graham essay which touches on the idea that "You Need an Application to Drive the Design of a Language"; that writing a language for the sake of writing a language without an application in mind isn't (edit*) as effective[0]. Things like that, thinking about the reason for creating the language in the first place, is a fundamental starting point imo.

[0] http://www.paulgraham.com/langdes.html (Number 3)

aaron-lebo · on April 22, 2017

So I checked out that link and:

This may not be an absolute rule, but it seems like the best languages all evolved together with some application they were being used to write. C was written by people who needed it for systems programming. Lisp was developed partly to do symbolic differentiation, and McCarthy was so eager to get started that he was writing differentiation programs even in the first paper on Lisp, in 1960.

...but can that claim really be supported? Are c and lisp "the best" languages and if that "rule" is true, why does the claim have to be tempered for the two examples (C was written by people who needed to do programming at the time - was PHP not written for needs, too? lisp was only kind of developed for symbolic differentiation, and was lisp 1.6 one of the best languages or do we prefer scheme, common lisp...arc)?

He could have at least tried to make us believe it! :)

fmap · on April 22, 2017

The truth seems to be that we don't know what makes a good language. In academic research we have criteria for what makes a language expressive, and tools for analyzing existing languages and check if they allow for modular development or are fundamentally incapable of building sound abstractions.

What we don't have are answers to the "soft" questions. What makes a language pleasant to use (ignoring familiarity)? Which abstractions are easy to communicate and why or why not? What's the best way to communicate intent to a computer? It seems very unlikely that the answer should be "a text editor", but so far all alternatives have turned out to be terrible...

What's especially vexing about this is that it looks like a solvable problem. People in the social sciences have been measuring far more ephemeral phenomena, the only problem is one of funding and finding the right people to run such big and extensive experiments.

loup-vaillant · on April 22, 2017

> The truth seems to be that we don't know what makes a good language.

On the other hand, we do know what makes a bad language. We can spot flaws in a language, and fixing them automatically makes them better —at least in the aspects one is interested in. C for instance has a number of flaws and questionable tradeoffs that are widely known by now:

Contextual grammar. While not too bad (the context is easy to maintain in the parser), unknown identifiers can lead to syntax errors, and slightly worse error messages.

Complicated syntax for types. Matching declaration and use may have sounded neat at the time, but doing so kills the separation between an entity's name and its type. The result is not very readable. An ML-like syntax would be much better.

Switch that falls through by default. The switch statement were clearly designed from an implementer's point of view —it maps nicely to a jump table. In practice however over 95% of switch statements do not fall through. Switch should break by default (Having multiple cases branch to the same code is more common, but is not incompatible with breaking by default, see how ML style pattern matching does it).

No direct support for sum types (tagged unions). We have structs , unions, and with them we can emulate algebraic data types. It's a pain in the butt however. Automating this somewhat would be nice, since sum types are so widely useful. Even if they were used just for error handling that would be a nice bonus.

No generics. Makes it much harder to abstract away common code. Want to write a generic hash table? Good luck with pre-processor magic.

Textual macros. We can do, and have done, better than that.

Too many undefined behaviours. In the name of portability, many things that would have worked fine on many architectures are now undefined because some architectures couldn't handle them. And now we have silly stuff such as undefined signed integer overflow. But we can't remove them because compiler writers justify this madness with optimizations! (For the record, I have seen Chandler Carruth's talk on the subject, and I disagree: when compilers remove security checks because of undefined behaviour, it is just as bad as nasal demons.)

Silly `for` loop. That damn thing is fully general, we don't need that. We have the `while` loop. A simpler, less general syntax would have allowed optimisations that currently exploit undefined signed integer overflow, and then some.

---

Of course, it's easy to criticise a language over 40 years after the fact. But that's kind of the point: we have learned a good deal since the 70's. A language written now could not justify most of the flaws above. This is why so many people (me included) dismissed Go out of hand: not providing generics in a new statically typed language is just silly.

Sure, designing a good language is still very hard. But we can avoid more mistakes now than we could some decades ago.

msangi · on April 22, 2017

This is a good point.

At least I find it encouraging that there are some of the newer languages that are taking errors from the past into account.

They're not going to be perfect and they surely have their own flaws, but that's something that someone will fix in a couple of decades once we will know what new ideas are actually good and bad.

coldtea · on April 22, 2017

>...but can that claim really be supported? Are c and lisp "the best" languages and if that "rule" is true, why does the claim have to be tempered for the two examples (C was written by people who needed to do programming at the time - was PHP not written for needs, too? lisp was only kind of developed for symbolic differentiation, and was lisp 1.6 one of the best languages or do we prefer scheme, common lisp...arc)?

I find most of the questions irrelevant to his point.

>Are c and lisp "the best" languages

Yes, by many metrics. They might no be the best academically perfect languages, but they both have been hugely resilient, influential, and/or succesful, which is a good metric for considering what's "best".

>* (C was written by people who needed to do programming at the time - was PHP not written for needs, too?*

Yes, and PHP is also very good for what it does, with huge adoption and powering over 30% of the web.

>was lisp 1.6 one of the best languages or do we prefer scheme, common lisp...arc

Doesn't matter, those are still Lisps or owe 90% of their existence to Lisp 0.1. Any languages can have multiple updates, mistakes fixed, added cool features, etc, PG talks more about the core of the language being great (the language as set of concepts etc), than about Lisp 0.1 or C 0.1 being great from the start. Whether those languages were later updated with even better constructs is kind of irrelevant.

aaron-lebo · on April 22, 2017

Yes, by many metrics. They might no be the best academically perfect languages, but they both have been hugely resilient, influential, and/or successful, which is a good metric for considering what's "best".

like JS? Is it a best language?

C and lisp are very different examples of success - one has been dominant in business for decades and one was very influential but isn't really a business success today. Those are hugely different metrics - are you talking about popularity or core language design?

If you expand the definition for success that wide, then yeah, you'll find that all "successful" languages fit your standard because you don't have a very useful standard.

Yes, and PHP is also very good for what it does, with huge adoption and powering over 30% of the web.

Yes, but knowing pg's stance on blub, he probably wouldn't agree. But that's the issue with the claim - PG didn't really expand on it so you have no idea what he really means.

The fact of the matter is we simply don't have that many examples of successful languages. Maybe n < 29 which means that every example we pick of "this is how language design works" is choosing from a very small sample, perhaps not representative of anything and all successful due to various factors. Not a good way to make "rules".

coldtea · on April 22, 2017

>like JS? Is it a best language?

Not as much. But JS is the opposite of what PG describes in 2 ways:

1) It didn't grow out by people with a need to develop some specific app, and it wasn't organically designed to serve such use. It was created with a top-down dictum by someone who didn't write, and didn't have to write, web apps himself, and in fact at a time where web apps weren't even a thing: it was created for light behavioral scripting. And it gets even worse, as some higher-ups even influenced its syntax (again by dictum): "make it look like Java".

2) Similarly, JS didn't survive and thrived as some evolutionary selected best way to serve the web. It survived as the only way to serve the web, period. So we can't say if its success is not anything else that it having been forced upon everybody + the success of the web in general.

>Yes, but knowing pg's stance on blub, he probably wouldn't agree.

No so sure. Actually IIRC PG used Perl along with Lisp in Viaweb later days [1]? Not that far from PHP as far as elegance is concerned, but he knew it was a good tool for the job. ("Viaweb at first had two parts: the editor, written in Lisp, which people used to build their sites, and the ordering system, written in C, which handled orders. The first version was mostly Lisp, because the ordering system was small. Later we added two more modules, an image generator written in C, and a back-office manager written mostly in Perl.")

asrp · on April 22, 2017

Same here about finding resources for language design.

I wish there was some course that could teach me language design so that I can (better) predict how my decisions will affect the final result.

Plus how much can reasonably I "pack" into the language [1] in terms of notation and cognitive load when used.

Maybe some resource about choosing notation would be nice too (as opposed to parsing notation once chosen). When I try to pick many ideas from different places, I'm sometimes surprised by collisions in notation.

In general, I have to idea in what model I have to run my estimates.

What are some examples of consequences taught through the course you linked to?

[1] I'm talking about textual languages only for now.

ftomassetti · on April 22, 2017

Maybe you could use some information theory to calculate the entropy of your language and evaluate when it makes sense to introduce new concepts. I played with this idea but I have never used it in practice for designing languages.

asrp · on April 22, 2017

Yeah, I was also thinking along those lines when writing my comment. That there trade-offs are like those in a Huffman tree. So one easily accessible/expresible feature means many more distant ones.

Has someone thought and wrote about this?

And there are other aspects that don't quite fit the information theory model (like using infix notation for operators or not, from Paul Graham's post).

nafizh · on April 21, 2017

Thanks. Will definitely check it out. What is the assumed background for the students?

justinpombrio · on April 21, 2017

> What is the assumed background for the students?

Just intro CS, I believe.

ftomassetti · on April 22, 2017

Nice, I will consider adding it to the list

rattray · on April 21, 2017

For anyone thinking about building a language with a JavaScript target, I would strongly recommend building on Babel. I've found the tooling to be terrific.

In my case, I wanted to build a (rough) superset of JS, adding the syntax and language features I was interested in but otherwise starting from a working (tested!) language. This made the task dramatically easier.

Forking babylon, Babel's parser, worked well for this in my case, but you could also write your own parser from scratch and use babel-types to build your AST.

In my case, I have a ~1,000-LOC babel plugin "compiler"[0] with maybe 1,000~2,000 LOC of mods to the parser[1], resulting in a language that looks fairly different from JS but feels pretty solid. Users of the language benefit from Babel's tooling – it works with webpack, ESLint (decently), etc.

[0] https://github.com/lightscript/babel-plugin-lightscript/

[1] https://github.com/lightscript/babylon-lightscript

mark_l_watson · on April 23, 2017

Thanks for the reference to your project, I might use it as a guide for writing a language. I like the book "Build Your Own Lisp" because the simple implementation is easy enough to understand and hack. Using your example as a guide would be a second step for building something more practical. Very cool stuff.

rattray · on April 23, 2017

Thanks! Would be happy to help – email is in my profile if you're interested.

ftomassetti · on April 22, 2017

This sounds awesome! Another good thing to add to the list

mhh__ · on April 21, 2017

Appel's Modern compiler implementation in ML is really good, as both an introduction to compilers and some detail on implementing the runtime i.e. garbage collection. There are versions for C and Java, but ML is much easier to read in a book.

If you want a (slightly ugly) encyclopedia of compiler optimisations and IR designs, then Steven Muchnick's Advanced compiler design and implementation is great.

Also, on the functional front SPJ's book (https://www.microsoft.com/en-us/research/publication/the-imp...) is really good.

While unfinished, Write You a Haskell is also a really interesting read.

Gracana · on April 21, 2017

I have the C book... It is quite antiquated and has a lot of ugly pointer math in the tokenizer/lexer beginning parts. It does sound like the ML book is a better choice.

nickpsecurity · on April 22, 2017

"It is quite antiquated and has a lot of ugly pointer math in the tokenizer/lexer beginning parts."

It's possibly C code written by a guy that mainly does ML with a whole career kicking ass in ML languages and provers. That was my hypothesis when I saw both books available. I got a copy of the ML one instead. Appreciate confirmation it was a good choice. ;)

mhh__ · on April 22, 2017

Yeah. AFAIK The C version only exists because ML is too niche for mainstream publishing. ML actually works as decent pseudocode(Or at least easier to mentally transpile to other langauges, unlike e.g. Haskell) as well, so the C/Java versions are basically pointless.

nickpsecurity · on April 22, 2017

That's what I was guessing. Especially after reading the article below and seeing lots of compiler/prover authors go with ML's.

https://news.ycombinator.com/item?id=14123100

tormeh · on April 22, 2017

What I've come to realize is that what we need is a language that lets you manage your level of technical debt, as in "langc mysource.lang --technical-debt=5". You generally want to start out writing PHP in the beginning of a project. However, at some point the technical debt will force you to enter maintenance mode. At that point it would be nice if you could turn down the compiler's tolerance from PHP-level to Rust-level, via Python and Java. Simultaneously you usually start caring about performance, so you need to move from interpretation to non-GC'd binary. Wouldn't it be nice if you could scale the strictness and compiledness of the language according to how messy and performance-critical your module or program has become? Granted, we'd need several backends for the same language and the compiler would become really quite messy, but it's interesting to consider.

wruza · on April 22, 2017

Few points should be mentioned here, mostly because [non]strictness and dynamicity are not the same.

Tunable strictness is already there for languages that use type inference (and nontrivial type system in general). Haskell, Terra, Julia to name a few. In terra you even have two separate worlds and the border is strict; static side doesn't poison dynamic side. Otoh, in haskell once you fix a type, all calls are specialized, so careful typing must be applied even for cases that really never happen.

Dynamicity is the other beast. Dynamic code is not one that was written in dynamic language, but one that uses dynamic approaches to the problem. Metaprogramming, functional programming, complex dispatch programming cannot be converted into efficient code without providing special hints to the compiler (iow writing c++ code). Even jits usually give up on meta paths and provide little-to-no benefit there. So in general we can't convert really dynamic code to its efficient static representation. Otoh, if the code is not so dynamic, then type inference will do everything right for it, no type magic needed.

Edit: I forgot to add julia's details, quick intro is here: http://stackoverflow.com/questions/28078089/is-julia-dynamic...

nine_k · on April 22, 2017

Larry Wall tried, with some success. A Lisp / Scheme codebase could also be transformed along these little nest.

mhasbini · on April 22, 2017

Can you please give more context on what Larry Wall tried?

nine_k · on April 25, 2017

Larry Wall created (or, rather, kept on creating) Perl, an ultimately multi-paradigm language, allowing for wild differences in style, approach, and even syntax. (Yes, this does create problems, too.)

wcrichton · on April 22, 2017

I can't agree more. I think a lot of the pieces to enable such a compiler stack are there--gradual typing, extensible parsers, and so on, but nobody has put them all together into a single place.

TeMPOraL · on April 22, 2017

Common Lisp? Except that the typing could use some extension (probably could be provided through macros).

sitkack · on April 23, 2017

Gradual Typing is a specific term, what you are looking for is optional typing.

ftomassetti · on April 22, 2017

I had a similar thought when using Clojure or Python: as the project matured I wished I could switch to some static typed language. I am very interested in language conversions, so I should give this a thought...

turbinerneiter · on April 22, 2017

It's called Cython. I had a Python implementation of a sensor fusion algorithm as well as a C version of the same. Python was a staggering 135 times slower. I used Cython on it, typed a handful of variables and the result ran in the same time as the C version.

The speed up always depends on what you are doing, my example was chosen to yield the most extreme differences.

mhh__ · on April 22, 2017

I think the idea of mixing between an interpreter to a Compiler (either or JIT) is a bit stupid.

The only way to realistically be able to statically define an api boundary, then implement library-dependant static analysis inside the language. This is already possible with today's tools except that most languages can't guarantee compile time evaluation or provide enough static reflection.

I'm assuming that by "scale the strictness", you don't mean optional lazyness?

velox_io · on April 22, 2017

This is one of the most overlooked topics in computing.

I've developed an in-memory database, naturally with it's own programming language.

I disagree that you have to have an application in mind to design your language (sorry GP). But, you definitely need use-cases in mind as every decision has trade-offs..

Some are discoveries I've made along the way: 1. High-level - I remember having a for-each loop zoomed in on my TV, I may have looked at the for a couple of hours as I looked at it and thought "how can I make it better", and it did pay off!

2. Parsers/ Lexers - I couldn't find one that worked the way I needed to (allowed me to remove lexemes that would normally be considered essential). So I started writing my own..

3. Abstract Syntax Trees (ASTs) - I don't think there's a guide on compilers that doesn't mention ASTs. I found this to be more of a hindrance than help. So I removed it, yes the parser generates the program code!!! All that is needed is a small stack of the starting points. This actually works better than anticipated. It's still a single pass compiler, and you rarely need to know to what is next (before delegating). There's checks at every step, the program can also generate sourcecode (reverse engineer itself). So moving away from ASTs doesn't mean that the program is harder to test. A unexpected result of going down this path means that the compiler 'understands' the code better than than most, able to supply actual suggestions rather than cryptic error messages. In-fact it will reject code that won't compile, or breaks dependencies.

The lesson here is don't be afraid to experiment. Yes it has taken some time, with improvements in each iteration (I've done some house keeping in the recently, swapping out ~30% of the parser and it hasn't been as daunting as you might imaging).

joshmarinacci · on April 21, 2017

For those just getting started on building languages here is a beginners tutorial I created, using JS and Ohm. https://www.google.com/amp/s/www.pubnub.com/blog/2016-09-26-...

whitten · on April 22, 2017

Is there a tutorial for building languages that uses an Earley parser? I always have a dream of using a controlled natural language for coding or as a view of existing code.

ternaryoperator · on April 22, 2017

Surprised not to see reference to Walter Bright's excellent article from Dr. Dobb's, "So you want to write your own language?" [1]

[1] http://www.drdobbs.com/240165488

ftomassetti · on April 22, 2017

Right, but the list had to be limited in scope, and I focused more on longer pieces of content

fizixer · on April 22, 2017

I've been meaning to understand the idea of programming languages, removed from the syntactic details, and removed from the issue of converting a program to a machine executable form, and even optimizing a program.

I have a hunch that Krishnamurthi's book might be a good read in that regard (though not sure).

However I wonder what's left after you remove the issues I mentioned above. There are two cases: typed language and untyped language. In the first case you're left with Type Theory. In the second case you're left with, either a Turing Machine or a Lambda Calculus (e.g., implemented in a syntax free language like scheme). Is this the correct line of thinking? Or is there more to PLT than what I described?

To continue the thought further, in case of scheme, you have the base language on top of which you can build any PL feature or any PL paradigm you want. That means what's left is a survey/exploration of all possible PL features. Here I'm not interested in syntactic features. But in that case, what's left is really a survey/exploration of all possible PL paradigms, or programming paradigms [1]. So I guess what I'm really after is a massive study of the equivalent of all programming languages, but in a syntax-free way, i.e., in scheme syntax! (And I'm looking for a book that can provide that).

[1] https://en.wikipedia.org/wiki/Programming_paradigm

kd0amg · on April 22, 2017

* However I wonder what's left after you remove the issues I mentioned above. There are two cases: typed language and untyped language. In the first case you're left with Type Theory. In the second case you're left with, either a Turing Machine or a Lambda Calculus (e.g., implemented in a syntax free language like scheme). Is this the correct line of thinking? Or is there more to PLT than what I described?*

Well, there's still what you describe in the rest of your post. Even something minimalist like Scheme has a load of features above and beyond plain lambda calculus: dynamic parameters, continuations, dynamic-wind, mutable variables, boxes, exceptions, the whole macro system, and so on. Not every programming language feature that can exist has already been invented. We also don't fully understand the implications of every language feature that has already been invented -- there's plenty of work to do in designing type systems that can cope with more language features. You're not going to find one book that covers every feature people have come up with, but PLAI at least gives a fairly broad guided tour. I would add a book on semantics (probably Winskel, maybe Gunter) to develop a mental toolkit for carefully defining and reasoning about new language features.

fizixer · on April 24, 2017

I guess you have a point. Thanks for your comment (and recommendations).

pramodbiligiri · on April 22, 2017

Have you looked at Concepts, Techniques, and Models of Computer Programming? - https://en.m.wikipedia.org/wiki/Concepts,_Techniques,_and_Mo...

fizixer · on April 22, 2017

Thanks. I haven't but I'll give it a try.

abhirag · on April 22, 2017

Agree with the recommendation of "Concepts, Techniques, and Models of Computer Programming". You can also have a look at "Essentials of Programming Languages"(http://www.eopl3.com/).

fizixer · on April 22, 2017

Thanks. I should take a look at those two books.

jez · on April 22, 2017

I think you'd really enjoy Practical Foundations for Programming Languages (it's one of the first books suggested in the original article). It approaches language design like you say: start with the essence of the feature, don't worry about the syntax. It analyzes each feature thoroughly so you really understand exactly what the essence of that feature is.

fizixer · on April 22, 2017

Thanks. Looks like the kind of books I'm looking for.

itsmemattchung · on April 22, 2017

Awesome--I've been knee deep learning how C code translates into assembly instructions, learning more about the compilers roles and responsibilities; but in the back of my mind, I aways wonder: why is the C language (or any language, really ) designed the way it is. A more concrete example: instead of "how do closures work in python," I wonder "WHY are closures designed to work this way in python ?"

nickpsecurity · on April 22, 2017

It mostly wasn't designed if you're talking about C. It was almost totally the result of terrible hardware and personal preference derived from BCPL which was terrible hardware. The Vimeo link below goes from one paper & computer to another from CPL to BCPL to B to proto C's (IIRC) to C itself. Also shows interplay between C and C++ development. I'm also including my rant that summarizes the video in bullet points with some commentary. I still need to update it, though, with critiques from last time it was on HN.

https://pastebin.com/UAQaWuWG

https://vimeo.com/132192250

Now, for the other languages, who knows. I agree it's always fascinating to learn how they were done. I didn't get that feeling learning C's history. Not like when I studied Wirth's Lilith and Modula-2:

https://en.wikipedia.org/wiki/Lilith_(computer)

Note: Full description is in references under "Personal Computer Lilith."

rendall · on April 22, 2017

Is it just me or do the scroll bars disappear on that site? In both Chrome and Firefox latest versions, I cannot scroll down.

camgunz · on April 22, 2017

Yeah they're gone for me too. Space, Page Down or Down Arrow do nothing.

kushti · on April 22, 2017

I would like to read anything on cost-analysis, especially efficient cost-analysis on ASTs.