GHC Parser Principles

danharaj · on May 14, 2018

I have a commit on GHC (a very modest one). The compiler is written in a mix of somewhat old Haskell style and new which makes sense since it's one of the oldest and largest codebases in the language. There is a constant drive to modernize it and some new techniques come out of GHC architectural changes [0].

I didn't work with the parser, mostly the constraint solver in the type checker. Still, I browsed quite a bit of code. It's one of the most pleasant and well documented codebases I've worked in in any language. In particular, the [note] system is an excellent way to embed long form documentation that touches many components without using fancy doc tools: grep suffices.

If you want to see what a large, old, industrial grade Haskell project looks like I think GHC is a good place to look. And if you ever run into trouble with GHC then I encourage you not to be afraid to hack on it. The code won't bite and the maintainers have no shortage of patience for newcomers.

[0] https://arxiv.org/abs/1610.04799

piinbinary · on May 15, 2018

I appreciate the pun in that paper:

  Algebraic Data Types (ADTs) and pattern matching in functional languages
  lay a fertile ground to conveniently define and process data types as tree-like
  structures. However, in these grounds, trees often cannot grow; once a data type
  is defined and compiled, its definition cannot be extended.

endgame · on May 15, 2018

Can you go into more detail about the [note] system?

ufo · on May 15, 2018

Simon Marlow and Simon Peyton-Jones wrote a chapter about GHC for the Architecture of Open Source Applications ebook and they talk about the note system there. Skip to section 5.6.

http://www.aosabook.org/en/ghc.html

harpocrates · on May 14, 2018

I have very mixed feelings about using parser generators instead of hand written parsers. I've contributed to GHC's parser grammar and spent a lot of time reading Rust's hand-written parser.

On one hand, a grammar is an invaluable tool. It helps keep you honest about ambiguities when you start thinking about adding language features. I think languages should always keep some reference grammar up to date.

That said, I don't think it is a good idea for compilers to use parser generators. A hand-written parser can have better error recovery and more helpful error messages. GHC's parser has some pretty bad parse error message.

Some work was merged into Happy a while ago to support better messages (enabling you to see the expected tokens along with the bad token), but using that in GHC itself is blocked by how complex GHC's tokenizer has become.

tom_mellior · on May 15, 2018

> A hand-written parser can have better error recovery and more helpful error messages.

That depends on what you're comparing to. The Menhir parser generator can generate a file with all possible "error states" in the parser in which you can specify very nice and friendly error messages for each one.

I've posted this before: http://gallium.inria.fr/~fpottier/slides/fpottier-2015-11-ou... (ignore the French title, the rest is in English). The examples for a C grammar in Menhir show much better error messages than the hand-written GCC ones.

kazinator · on May 14, 2018

Grammars don't keep you entirely honest about ambiguities, because parser generators resolve some ambiguities like shift in favor of reduce in Yacc. It's possible to massage a Yacc grammar into working and then not entirely understand it.

> A hand-written parser can have better error recovery and more helpful error messages.

A hand-written parser lets you put a breakpoint on the function which handles a given phrase rule, and when the execution stops there, you get a call stack which more or less corresponds to the top-to-bottom derivation, with arguments you can examine at every level.

ufo · on May 15, 2018

The difference is that the bottom up parser is able to detect the ambiguity and complain about it at all. And anyone who wants to keep their sanity won't rely on the default yacc behavior and will take care to get rid of all the shift-reduce conflicts.

grumpyprole · on May 15, 2018

Since GHC was created, we now have some very good parser combinator libraries such as Parsec and uuparsing-lib. Haskell is a very good language with which to write a parser by hand. But no doubt it would be a big project to move off happy and success isn't guaranteed.

arturogoosnargh · on May 15, 2018

I'm writing a Haskell compiler in C as a hobby. I didn't take a look at GHC code but now I find the way they deal with optional indentation interesting. Nonetheless I implemented the same thing in a simpler way having a function in my parser that checks for the presence or absence of curly braces and sets some state on the parser.

My implementation: https://github.com/angel-manuel/schc/blob/c1fa77b332a765b476...

kreetx · on May 15, 2018

Out of curiosity, do you have a mathemathics background? I.e for me the scaryest part of this is type inference and type checking.

FullyFunctional · on May 14, 2018

That's neat; parser working hard to make the language look simple.

The success sword of GHC is dual edged; it's made Haskell effectively synonymous with "Language accepted by a recent GHC". The upside is an amazing language with great many advanced features. The downside is an enormous language in a monoculture. It's nearly the antithesis of C (small language that grew little and slowly, but for which there are 100s of implementations).

_fq4v · on May 15, 2018

Haskell 98 is a very simple language.

Modern GHC Haskell is to Haskell 98 as C++ is to C.

sevensor · on May 14, 2018

It's funny, you could also say this of Ruby and MRI, which is in so many other respects the polar opposite of Haskell.

Edit: Spelling.

FullyFunctional · on May 15, 2018

I don't know either, but FWIW, there _are_ many (ok, several) Haskell implementation, but they only implement the language defined in the standard, and nobody has been able or willing to chase all the extensions in GHC.

Does Ruby and MRI have specs detailed enough that you could implement the entired language from just those? Few languages do, but Scheme and Haskell are among them.

atombender · on May 15, 2018

Ruby is actually an ISO standard. Admittedly, it's frozen at Ruby 1.8 and it doesn't count for much.

sfvisser · on May 15, 2018

Similar to what I'm noticing just right now in VSCode where the IDE warns you comments are not allowed in a JSON file. This means the parser happily parses comments and performs a separate check afterwards. Beats a "unexpected symbol '/' at 12:4"

Great UX!

evincarofautumn · on May 15, 2018

One thing I’ve learned in my years of working on compilers & programming language tech, which at first is counterintuitive, is that for the best user experience in a compiler, often you want your parser to be “total”, and not to reject any inputs.

Instead of bailing out at the first error, you parse invalid portions of the input and sync up to the next point where you can recover and keep parsing, annotating the resulting token stream or parse tree with error indicators such as “invalid number literal token” or “missing bracket”.

This lets you report all errors in a source file at once, so they can all be highlighted in an IDE for example, regardless of errors earlier in the file. The more semantic information you have about the correct parts of the program, the better error messages you can produce about the incorrect parts.

You can extend this to all areas of the compilation pipeline—for example, proceeding to resolve names and typecheck the definitions in a program that are syntactically correct, while ignoring the others and leaving them as error markers.

Zalastax · on May 15, 2018

This philosophy is an integral part to the good error reports in TypeScript: https://www.infoq.com/news/2016/05/anders-hejlsberg-compiler

tom_mellior · on May 15, 2018

> Instead of bailing out at the first error, you parse invalid portions of the input and sync up to the next point where you can recover and keep parsing

People like saying that, but I've never seen it work well or provide any benefits. Because these tools are necessarily imperfect, you will get false positives, i.e., an earlier error will confuse the system sufficiently to generate later errors that aren't really errors in the program. So you have to fix errors in order anyway. So there is no benefit to producing further messages. You just risk producing several pages of spurious diagnostics for a simple typo.

Of course this is not completely true, but it's true enough. Or if it isn't, I'd like to hear about examples of systems that never produce such false positives, and where for some reason it makes sense (and is actually beneficial) to change something later in the file first.

Until then, just give me the first error. I won't look at the rest anyway, so no need for the compiler to be "clever" about it.

evincarofautumn · on May 15, 2018

It works best in an IDE, or in my case in Emacs where I can jump to each error in turn. On the command line or in an interactive mode (REPL, notebook, &c.) I usually just want one error at a time. I’ve had fairly good luck with this in C# in Visual Studio or Xamarin Studio with Roslyn, although I don’t actually use .NET much—even though I worked on Mono for a few years, hah.

It also really depends on the grammar of the language how well you can do recovery. Ideally a language should be designed from the beginning to support it. Simple example: if a language allows nested function definitions with the same notation as top-level definitions, then you can’t reliably insert a missing closing bracket for a definition, because you can’t assume that the next one is a new top-level; it could be nested. (Although of course you can use heuristics like indentation to get around this.)

Anyway, the point of successful recovery is that the diagnostics later in the file are not spurious. And if the compiler supports recoverable parsing, you can still have a mode where it just bails out after the first error if that’s what you prefer.

tom_mellior · on May 16, 2018

> Anyway, the point of successful recovery is that the diagnostics later in the file are not spurious.

True if you only consider superficial syntax things. False in the real world, where compilers also check whether identifiers are declared, and with what type. If you have an error (syntax or semantic) and decide to press on, in general your symbol tables will be in a messed-up state, and you will generate spurious type or "undeclared identifier" errors. Also, in the real world, the compilers I have experience with do not only continue if they can be 100% sure that it makes sense, but they invariably continue further because they try to be "smart".

The IDE point is a semi-valid point since it doesn't scroll to the bottom-most, most useless errors like a terminal does. But I'm still not convinced that it adds value. That it enables situations where it's better to fix the second error before the first.

evincarofautumn · on May 17, 2018

Yeah, it’s definitely a bad idea to press on beyond the point where you can reasonably recover, and actually better to err on the side of fewer diagnostics and less information per message if it means the messages aren’t misleading.

For example, if a compiler says “Expected semicolon before open bracket”…you should be able to insert a semicolon there! In my experience, that’s what a beginner programmer or someone unfamiliar with the language is going to try. If that’s not possible, the compiler is probably leaking implementation details of the parser, and should say something else. The most important thing is that the source location points to the actual code where something is wrong so the programmer can examine it. The message is secondary—make it actionable if possible, otherwise just avoid making it actively harmful.

But it also really depends on the language how well you can do recovery. Most languages in wide use are not designed to handle this, in both syntax & semantics.

Again, even if you don’t want to go for full error-recovery, it’s still a good thing to support in a compiler pipeline, because it makes it easier to produce good error messages if the compiler accepts a superset of the actual language—recognising common mistakes in order to offer fixes, for example.

ufo · on May 15, 2018

This works better in an IDE than if you interact with the compiler directly. If you look at the compiler output directly the false positive drown out the actual errors so it is hard to find out what you need to fix. But in an IDE the false positives just become red squiggly lines further down the file, and don't obfuscate the important red squigglys at the top.