Implement a programming language from scratch

bishc · on March 31, 2015

In my experience, many people will miss the significance of this, or will simply fail to grasp how this is implementing a programming language at all. After all, you're taking advantage of the host language for every single aspect of the new language, even tokenizing/parsing.

More interesting for beginners (in my opinion) are examples implementing a simple tokenizer, parser, and compiler to another source language. This seems to be what closes the gap between programming languages as mystical constructs and programming languages as programs themselves.

This point of view does tend to displease the traditional SICP crowd though :-).

munificent · on March 31, 2015

I'm inclined to agree. Implemented a Lisp in a Lisp always feels like a certain amount of hand-waving without feeling like I've really made something. Look, here's how to implement a JavaScript interpreter in JavaScript:

    eval(someCode);

What I like is seeing all of the steps a typical real-world language implementation goes through, just in minimal form. To that end, a while back I wrote an interpreter for a BASIC-like language in a single Java source file. It tokenizes, parses, and interprets. It uses common real-world techniques for all of those:

It uses a hand-rolled state machine for tokenizing. Recursive descent for parsing. And the interpreter uses the visitor pattern to walk the AST.

Mine also, strangely unlike many of these so-called "teaching" toy languages, has documentation. I don't understand the point of a tiny language for people to learn from if you made it tiny by removing all of the comments. :(

And, to try to avoid leading the reader astray, it calls out any shortcuts it makes. Those are hints where you'd want to do something more robust if you weren't trying to be minimal.

It's here:

https://github.com/munificent/jasic

If you work your way through that, you'll be a long way towards being able to find your way around a real-world interpreter.

The main things it doesn't do is:

1. GC. It leans on Java for that.

2. Compile to bytecode or some other representation. It's a simple tree walker, like Ruby 1.8.

If you want to learn more about those, take a look at:

https://github.com/munificent/wren

minopret · on March 31, 2015

I agree. A counterargument is that you're inherently taking advantage of the processor(s) and the operating system, not to mention, ha ha, every single component and factor that you did not personally create, whether or not it is computational in nature...

Another route, my present hobby and something that I know many have done before: Take the Lisp program on page 10 of the Lisp 1.5 User Manual and translate it into C. Write a simple tokenizer/parser (read function) in C and translate that into Lisp. Once the C program can run the equivalent Lisp program, we're on our way.

At the outset I have no garbage collection. I just expect the memory space to fill up pretty soon. I could have `cons` compute a hash for each cell to eliminate common subexpressions, but for a while I suppose I won't bother at all. The C program has a loop to handle evcon (did I write cons? oops), eval, and apply as a trio of mutually tail-recursive operations. That's an obvious place to put a stop-the-world mark-and-sweep operation or some such thing.

To get the C tools out of the loop eventually, hand-disassemble the binary program (C compiler output) just to see what's there. Then write a Lisp program (compiler) that translates the Lisp 1.5 User Manual program into something pretty similar.

kazinator · on March 31, 2015

A counter-counter argument is that implementing a language that is semantically distant from assembly language, over top of assembly language, counts as "from scratch", whereas skinning a language over with an interpreter (which doesn't even scan tokens, collect garbage or perform any I/O) doesn't count as "from scratch".

bdowling · on April 2, 2015

What's the language that is semantically distant from Lisp, implemented over Lisp?

kazinator · on April 2, 2015

An example that pops to mind might be Shen. (In fact the subject of a HN submission just within the past several days.)

nightski · on March 31, 2015

Who says a language has to be text based and have a tokenizer or parser? The point is it really isn't that relevant to the language.

It's kind of like saying in order to learn how to exercise we should play some video games first.

coldtea · on April 1, 2015

>Who says a language has to be text based and have a tokenizer or parser?

99% of languages out there, including in the Lisp family.

>It's kind of like saying in order to learn how to exercise we should play some video games first.

No, it's more like saying that in order to learn how to exercise we should move our bodies...

vog · on March 31, 2015

> More interesting [...] are examples implementing a simple tokenizer, parser, and compiler to another source language

> This point of view does tend to displease the traditional SICP crowd though :-).

I disagree. SICP does a fairly good job in explaining what's behind compilers and other languages. See chapter 5 "Computing with Register Machines" that follows immediately the chapter that contains the eval/apply loop.

Although the tokenizer and parser are sadly missing in SICP, these are also the most boring topics. Yes, you can write them by hand, but in the end you'll learn how to use regexes and parser generators. Which is important, but only a tiny fraction of what happens inside a compiler or interpreter.

vidarh · on April 1, 2015

> but in the end you'll learn how to use regexes and parser generators.

In my experience, most production compilers and interpreters use handwritten parsers, not regexes and parser generators. The latter is usually confined to small DSLs and toy compilers because things like error reporting and recovery is usually a nightmare with parser generators.

I don't agree the tonenizer and parser bit is "boring", but they are extremely well covered compared to subjects like code generation, though.

kd0amg · on March 31, 2015

After all, you're taking advantage of the host language for every single aspect of the new language, even tokenizing/parsing.

Everything except for actually carrying out function application (which is pretty much the only computation in a language this tiny).

joe_the_user · on March 31, 2015

Here is a prolog in interpreter in c++.

http://www.cl.cam.ac.uk/~am21/research/funnel/prolog.c

A list of tiny, powerful interpreters would be good.

tromp · on March 31, 2015

The (binary) lambda calculus interpreter at http://www.ioccc.org/2012/tromp/tromp.c, documented in http://www.ioccc.org/2012/tromp/hint.html is as tiny (and incomprehensible) as it gets, but does all of input, recursive descent parsing, translation to bytecode, lazy evaluation (call-by-need to be precise), garbage collection, and output.

joe_the_user · on April 1, 2015

Would it be possible to simply a have a normally formatted version of the code?

tromp · on April 1, 2015

A slightly less obfuscated and somewhat more normally formatted version:

  #include <stdio.h>
  #include <stdlib.h>
  #include <assert.h>
  enum {I,O,V,A,L};
  int n=44,i,c,T[M]={L,A,8,A,2,  V,0,L,L,V,
      A,30,L,A,2,V,0,L,A,5,A,7,L,V,0,O,
      A,14,L,A,2,V,0,L,A,5,A,2,  V,0,O,O,A},b,s;
  typedef struct _{int t,r; struct _*e,*n;} C;C*e,*f,*l,*S[M];
  void x(int l,int u){for(;l<=u;T[n++]=T[l++]);}
  int g(){i--||(i=b,c=getchar());return c>>i&1;}
  void d(C*l){!l||--l->r||(d(l->e),d(l->n),l->n=f,f=l);}
  int p(int m){if(g()){for(T[n++]=V;g();T[n]++);n++;}else
            T[m]=n++&&g()?(T[m+1]=p(++n),A):L,p(n);return n-m;}
  int main(int t){char o;
   b=t>1?0:7;T[43]=p(n);i=0;
   for(t=b?10:26;;)switch(T[t]){
    case I: g();i++;assert(n<M-99);if(~c&&b){x(0,6);for(T[n-5]=96;i;T[n++]=!g())
             x(0,9);}x(c<0?7:b,9);T[n++]=!b&&!g();break;
    case O: t=b+t>42?(o=2*o|t&1,28):(putchar
                (b?o:t+8),fflush(stdout),b?12:28);break;
    case V: l=e;for(t=T[t+1];t--;e=e->n);
                     t=e->t;(e=e->e)&&e->r++;d(l);break;
    case A: t+=2;f||(f=calloc(1,sizeof(C)));assert(f&&  s<M);S[s++]=l=f;f=l->n;
            l->r=1;l->t=t+T[t-1];(l->e=e)&&e->r++;break;
    case L: if(!s--)return 0;S[s]->n=e;e=S[s];t++;break;
   }
   return T[t+2];
  }

e12e · on April 1, 2015

I loved the "How I start: Nim"-article that was posted to hn a while back. Interpreter, then compile-to-nim implementation of brainfuck. All in a handful of code:

https://news.ycombinator.com/item?id=9021244

samatman · on March 31, 2015

C'mon, why stop there? Implement Scheme in your toy lambda. Then implement another toy lambda in your Scheme, and a Scheme in that. Continue until it takes several minutes to do (fib 5). Then, write JIT compiler for that Scheme. See what it takes to get down to pre-Inception levels of performance.

Now, you've actually learned something: Abstractions are costly.

gus_massa · on March 31, 2015

Too late, someone already did that and measured the overhead of up to five interpretations layers: "Scheme benchmarking with a meta-circular" interpreter http://yinwang0.wordpress.com/2013/11/04/scheme-benchmarking... Archive: http://web.archive.org/web/20140105145428/https://yinwang0.w... HN discussion: https://news.ycombinator.com/item?id=7045734 (25 points, 443 days ago, 9 comments) (The main critic is that it should compare more scheme implementations.)

Many modern schemes (and similar languages like Racket and Clojure) have many optimization, use bytecode and jit compiling, so the abstraction overhead is not as big as in the naive versions. And you get nice code and powerful macros in exchange.

IIRC Stalin is faster than C in some benchmarks, but the compiling time is very big. "Stalin: a global optimizing compiler for Scheme" https://github.com/barak/stalin HN discussion: https://news.ycombinator.com/item?id=8214343 (85 points, 220 days ago, 70 comments)

minopret · on March 31, 2015

Or you can go on to implement all the toy lambdas in all the toy lambdas, simultaneously.

* [The Mystery of the Tower Revealed: A Non-Reflective Description of the Reflective Tower](http://www.cs.indiana.edu/cgi-bin/techreports/TRNNN.cgi?trnu...) [abstract](http://www.ccs.neu.edu/home/wand/Bibliography.html#WandFried...)

* [Reflection](http://library.readscheme.org/page11.html)

Maybe abstraction is only a means to [expressiveness](http://programmers.stackexchange.com/a/254867/100375).

Peaker · on March 31, 2015

Unless your compiler strips them out!

amatic · on March 31, 2015

Here is something from my todo list, build lisp in C: http://www.buildyourownlisp.com/

wesleyy · on March 31, 2015

Scared me for a second, thought he as going to make a programming language in the MIT gui language "Scratch"

blt · on March 31, 2015

Just add garbage collection!