Sometimes i wish for compiler tutorials that progress feature by feature, starting with simple features, rather than starting with complicated things like parsing expressions with multiple op codes (op codes being implemented as function calls, so it gets even more complicated)...
For example, a sequence like this:
1. variable assignment
2. commands of one parameter
3. if / branch statements
4. much later: nested expressions, function calls, multiple stacks, environments
basically, starting with an abstraction of 3 parameter opcodes, and then slowly building features up the abstraction layers, as well as slowly building down the optimization layers, step by step.
It is hard to beat using Prolog for whipping out interpreters. Like Lisp, you get to use symbols in terms, plus you also get to use Prolog's ability to use infix operators to create terms:
eval(X,X) :- number(X).
eval(X+Y,Z) :- eval(X,X1), eval(Y,Y1), Z is X1+Y1.
eval(X*Y,Z) :- eval(X,X1), eval(Y,Y1), Z is X1*Y1.
main :- write("Enter an arithmetic expression followed by a dot \".\" and a newline\n"),
prompt(_Old,>),
repl.
repl :- read(Input),
eval(Input,Answer),
print(Answer), nl,
repl.
Evaluator; Parsers too. Prolog is an amazing idea (even with its limitation). But when you come from the bit array banging side of computer programming it's such an alien idea.
I've also been creating a programming language recently [1]. I'd recommend looking into the pypy tool chain. It makes it pretty easy to hack together a decent interpreter (that JITs). That and rply for the front-end gives you enough tooling to be dangerous. Here's a good tutorial [2].
Thank you! I've seen a lot of great articles and tutorials on HN about this subject, so you can probably find a lot of good resources doing a search here. To start I can link you to an article I wrote that teaches the basics of tokenizing and parsing: https://angularclass.com/making-a-mini-lisp-introduction-to-...
Q: Why do those programs exist? Lex and yacc originally come from Bell Labs. Why wouldn't they use what you recommend instead?
And when I look at the credits for lex and yacc today I see some very capable programmers. Why do these programs exist? I may be biased somewhat as I use them regularly for small tasks, but my question is an honest one.
Also, someone else mentioned swi-prolog. I think gprolog is just as easy and is noticably faster. Curious what HN readers think are the advantages to swi-prolog.
Because many don't know any better and are stuck in 70's parser tools mindset.
Tools like ANTLR[0], SableCC[1], JavaCC[2], MPS[5] and approaches like Attribute Grammars[3], Parsing expression grammars[4] are much more suitable for the modern days.
I don't know gprolog, but SWI has GUI tooling, compilation and I was already using it in 1997, so it is very sound toolchain.
I've tried Antlr and it was a horrible experience compared to Menhir: Being stuck with LL(n) causes a lot of unnecessary refactoring, the syntax for anything out of the ordinary is arcane and IMO not well documented.
LL(n) can be limiting indeed. But you can easily mix, say, PEG and Pratt. Because where else do you need LR-like parsing but in the binary expressions, the C-style dot syntax and so on.
I think this is kind of a big question, isn't it? You're in effect asking why people would write LL parsers versus LR.
The short answer might be that LL is simpler, and that handwritten recursive descent might be faster (especially if lexing and later stages are directly integrated into the parser) than tooled LALR parsers. LR is more expressive and can directly handle more grammars?
I think that, in the general case, the idea that recursive-descent parsers can only parse LL(k) languages is not quite true. For some theoretical conception of "parser", perhaps, but most real, hand-rolled recursive-descent parsers rarely recognize precisely an LL(k) language, because programmers tend to mix-and-match algorithms (mixing in a little shunting yard or precedence-climbing to handle infix operators, for example) and they tend to connect the parser with other modules in less restrictive ways (add in some two-way communication with the lexer and you can parse yourself a context-sensitive language, even!).
That's not to say that I don't agree with the general argument that there's no sense in shunning parser-generators all the time. I've made good use of yacc (and ocamlyacc, and menhir, and Lua LPEG, etc.) in the past, and the parser for our commercial compiler at work is written with ANTLR to generate an LL() parser in C++. Such metaprogramming tools let us work at a higher level of abstraction, and we get all the usual benefits of working at a higher level.
I think that, in most* cases, there's really no good reason not to use a parser generator (or, alternatively, parser combinators). I don't even suspect that you learn anything in particular that you wouldn't learn by learning how to formalize your grammar and specify semantic actions for a parser generator (unless perhaps you aren't familiar with mutual recursion, that is...). However, there are some good reasons not to use a generator[1], and, as with anything, the trade-offs need to be carefully considered before deciding.
This all reminds me -- I've been meaning to play with Treetop for Ruby lately :)
--
[1]: For example, it tends to be much easier to provide good error messages with a recursive-descent parser. It's not always obvious how a particular configuration for a table-driven parser corresponds to a particular mistake or code formation. There has been some fruitful research in that area of late, though.
There's also something to be said for not having a dependency on an extra tool, especially since such tools tend to be complicated and difficult to debug or extend. As such, when you encounter a language construct that's not easily definable for the generator, you may have have any easier time resorting to some hacks to get what you want rather than just extending the tool.
> it tends to be much easier to provide good error messages with a recursive-descent parser
It is not any more complicated with a parser generator (if it's a PEG-based one). In fact, it's much easier, you're avoiding a lot of boilerplate this way.
> There's also something to be said for not having a dependency on an extra tool
Do not have a dependency on an extra tool. Use a meta-language, in which a parser generator can be embedded as a first class language feature.
> It is not any more complicated with a parser generator (if it's a PEG-based one). In fact, it's much easier, you're avoiding a lot of boilerplate this way.
True. PEGs are effing great!
> Do not have a dependency on an extra tool. Use a meta-language, in which a parser generator can be embedded as a first class language feature.
This is ideal, but unfortunately not always possible. For example, if you're stuck writing a language processor in C, you don't get the necessary tools for linguistic abstraction without an external dependency. In general, though, I agree with the sentiment.
Given how much of a novice I still am in the subject, might want to take my suggestions with a grain of salt compared to others.
I have always really enjoyed just reading the source code for programming languages. As I learn more and more, I seem to take away a bit more each time.
Personally I've enjoyed reading through the source code for Go, since it is hand written in Go. Being hand written, it can be a little repetition reading through it, but I find it to be pretty easier to read/understand.
Also, I have read (at least parts of) the book, "Engineering a Compiler", which being a novice in the subject, some of it goes over my head, but I think it does a better job outlining the topic than any other books I have read.
I highly recommend Peter Norvig's example Lisp interpreters. He has Python and Java implementations and they are both very accessible, so you can digest them in a few hours' time. "Production" language interpreters tend to have so much complexity that it can be hard even for an expert to find the essence of the interpreter.
Very cool. Knowing little about programming language implementation, I wonder what it would take to make a compiler. For so simple a language, I would think it would be not much harder, no?
Not looked at OPs code, but if you build this stuff how a compiler book tells you to... The front end of a interpreter and a compiler is the same. You build a AST, if you interpret you well interpret that. If you have a compiler, the AST is turned into the target language, in most cases asm. The front end is the scanner/Lexer, the parser and most likely semantic analyser. So to answer you question :), if you make a non optimizing compiler, turning the AST into asm is trivial. The hard part of making compilers is the optimization. Most of the front end you do not need to write actually! There are compiler-compilers for that.
Depends what you compile to. Writing a register allocator may be a little tricky (if you compile to a VM/Assembly/..), though if you compile to something like C, it's very easy. LLVM is another alternative which is also relatively simple to learn.
Compilers are in general far simpler than interpreters, especially if you're trying to build an interpreter with a reasonable performance.
For a simple language compiler is nothing more than a chain "parse -> strip off syntax sugar -> resolve lexical scope -> propagate types -> flatten the code -> emit target code", with each step being independent and trivial, nothing more than a set of tree rewrite rules.
ASTs are fighting back these days! I work with a system at Oracle for going straight from AST to JIT, with no bytecode, and it's proving to be very fast and very easy.
That's a nice set of readable code. I made a minor change just to add in floats instead of ints, which wasn't too hard to accomplish thanks to your structure.
(Though my changes are ugly).
If/When you get to implementing functions things will get really useful :)
I just realized reading your comment that I did implement functions but I forgot to document them! I will update the readme immediately.
If you look in the main file, huo.c, you will see something called store_defs. That function takes the ASTs of defined functions and stores them in a key-val type store. If someone invokes a user-defined function I just grab the AST, replace the variables with the values they passed in and execute it. That code is in process_defs and is invoked by execute.
You may want to take a look at Bison/Flex. While they don't suit the tastes of everyone, it's good to know how to work with them. For example, BASH parser is written using those tools.
That's a good idea. I don't think it would be hard, you just get the interpreter to run in a loop and read from the keyboard input. I'll put that on the list for upcoming features :-)
For example, a sequence like this:
1. variable assignment
2. commands of one parameter
3. if / branch statements
4. much later: nested expressions, function calls, multiple stacks, environments
basically, starting with an abstraction of 3 parameter opcodes, and then slowly building features up the abstraction layers, as well as slowly building down the optimization layers, step by step.