Colm programming language released: best parser-writer ever

ScottBurson · on May 18, 2011

Thurston claims that no previous grammar system supports his three requirements of generalized parsing, grammar-dependent scanning, and context-dependent parsing.

I would argue that Prolog Definite Clause Grammars, which date back to the early 1970s, have all three of these properties. Furthermore, since the context is maintained functionally, by threading additional values through the productions, no "undo actions" are required; Prolog's built-in backtracking is all that's needed.

Of course, the problem with DCGs is performance: they're exponential in the worst case. But I think they deserve mention in a dissertation like this anyway. Also, any backtracking parser risks exponential worst-case performance; it will be interesting to see how Colm avoids this fate (I've only read the first few pages yet).

alan-crowe · on May 18, 2011

I remember, back in 1985, trying to write a parser in Prolog on a VAX 11/750. I didn't know what I was doing and my code fell into the exponential case trap. Even minimalist examples appeared to crash as they took unreasonably long, so I wasn't getting clues to the problem, and just gave up.

Consequently I want to rephrase the first sentence of your third paragraph. The problem with DCGs is performance: they're exponential in the naive case. That might not sound too bad today, but back in 1985 the hype was that Prolog let you program declaratively. Just declare what a parse looks like. Reasonable performance in the naive case was the popular selling point for Prolog.

thurston · on May 18, 2011

Where is the grammar-dependent scanning?

Note that threading the context through the parse tree while maintaining fully generalized parsing requires keeping all versions of the parsing context in memory. Consider making a C++ parser in that way ... ie every time you modify the structures you build in memory you make a copy of them first.

Hemospectrum · on May 18, 2011

If you know that every subfield of your state object is immutable, and you're using references instead of "local" copies (inconvenient unless you have garbage collection) then your copying costs are limited to the size of the overall state object plus whichever field changed.

Obviously this makes no sense for C++, but Clojure and OCaml get away with it and in Haskell it's the standard way of implementing almost every stateful computation.

swannodette · on May 18, 2011

If your data structures are persistent data structures you don't incur the costs of copying.

thurston · on May 18, 2011

Indeed I did not follow what you meant. I missed the "persistent" and saw ... "don't incur the cost of copying." I took that to mean that you were referring to not maintaining any history and using a single, mutable global state. It is with that understanding that I made the previous comment.

To address what you meant ... it's more a matter of what's practical. For example, to parse C++ one needs to do quite a bit of semantic analysis during parsing. A full implementation of the type system is necessary. Then the data structures change with every statement/clause parsed. Achieving this with persistent data structures is far too costly.

thurston · on May 18, 2011

Then you forgo generalized parsing.

Edit: indeed I did not follow what you meant.

swannodette · on May 18, 2011

I don't follow.

thurston · on May 18, 2011

To have both generalized parsing and context dependent parsing you need to somehow revert changes to the global state when you backtrack. You can do this by restoring it to old versions, which requires copying the state and keeping the history. The Colm approach is to keep only one global state, but store instructions for programmatically reverting the state while you backtrack.

jules · on May 18, 2011

The point of functional data structures is that if you modify, you don't copy the whole structure, you only copy the path to the things that changed.

A simple example would be parsing a sequence of characters and storing that in a linked list. When you get an additional character you don't destructively modify the list, instead you create a new list node that points to the existing list.

    def process(char, state):
      return Node(char, state)

The process function takes a character and a state, and returns a new state. The old state is still usable, as nothing was mutated. On the other hand no large data structure had to be copied.

Your method is probably more efficient. Because you don't need access to all historical versions simultaneously, you only have to keep a stack of undo operations. So you only have O(1) slowdown per operation. With functional data structures you have in the worst case a O(log n) slowdown per operation.

For example if you implement a bit vector, then update and lookup can both be O(log n), e.g. a red-black tree. If you want update to be O(1) then lookup becomes O(n), e.g. a linked list. If you want lookup to be O(1) then update becomes O(n), e.g. copying the entire vector on update.

(please correct me if I'm wrong and if you can do better than O(n) for those cases)

thurston · on May 18, 2011

Yes I understand these techniques, in fact they are heavily in use in Colm, just not for maintaining the global state that is used for the parsing feedback loop.

I use the term "copy" in the general sense to make discussion easier. Conceptually, persistent data structures are a copy, even if they are optimized heavily to incur very little cost.

swannodette · on May 18, 2011

Then you misunderstood my point, backtracking and keeping history with persistent data structures is very cheap.

jws · on May 18, 2011

Notice that the DNS example is parsing a binary DNS request, not a text file.

thurston · on May 18, 2011

:) If I had my way this comment would be closer to the top. Not many grammar-based parsing systems can claim raw DNS parsing.

haberman · on May 18, 2011

From my quick scan of the thesis, the basic design seems to be a programming language in which you write both the parser and any transformations you want to perform. It's not clear whether there is an easily-accessible parse tree serialization that you can use to load the output into another language, or whether you'd have to invent that yourself.

I think it's generally a hard sell if you try to convince people that they need to write their algorithms in your special language. Parsing tools deliver value because grammars are easier to write than the imperative code that implements those grammars. That value offsets the cost of having to learn a new special-purpose language. But imperative programming languages are already pretty good at tree traversal and transformation, so there's little benefit to using a special-purpose language for this.

I think that the next big thing in parsing will be a runtime that easily integrates into other languages so that the parsing framework can handle only the parsing and all of the tree traversal and transformation can be performed using whatever language the programmer was already using. This requires much less buy-in to a special-purpose language.

thurston · on May 18, 2011

Colm has built-in serialization. There is still some work to do in this area though. Colm will preserve whitespace for minimal disruption of untransformed text, but figuring out what to do at the boundaries between modified and unmodified trees can be tricky.

You are right, people want to use general purpose languages for the more complex algorithms. I agree a means of embedding is necessary and I have kept this in mind, though not yet achieved it. I would very much like to be able to parse, transform, then have the option to import the data into another environment and carry on there.

haberman · on May 18, 2011

Thanks for the info. What is the built-in serialization format?

thurston · on May 18, 2011

Just plain old text as it came in. I see now that is not what you were referring to. You're talking about JSON, XML, etc I now think.

There is also a print_xml function, which puts the tree into XML, but it's mostly used for debugging at this point, not export to other systems. I'm hoping that with time these kinds of features will crop up.

beza1e1 · on May 18, 2011

AntLR can do this, although it does not work that well. I used the C backend, which is pretty directly ported from the Java backend. C-in-Java-style is pretty awkward.

haberman · on May 18, 2011

ANTLR is not the same as what I am describing. ANTLR generates code in each target language: I am talking about a common runtime that all languages call into. Think of it as a "parsing VM." Using this scheme, there would be no need to have separate backends for C and Java, the only thing you'd need to port is the bindings.

beza1e1 · on May 20, 2011

Why would that be a good idea? What advantage would a just-in-time parser generator have that a static parser generator does not?

haberman · on May 21, 2011

Fast parsing from any language (even slow languages like Ruby) without having to compile and link generated C code for every grammar into the interpreter.

With this approach, you could have a C extension that could load any grammar at runtime and parse it extremely fast.

C/Assembly code is orders of magnitude faster at parsing than generating eg. Ruby that does the parsing.

danellis · on May 18, 2011

What you're describing sounds like the Gold parser: http://www.devincook.com/goldparser/

haberman · on May 18, 2011

Yes, or my own project, Gazelle: http://www.gazelle-parser.org/

bdfh42 · on May 18, 2011

Quote "Colm does not yet have any documentation".

Then I would hazard that it is not yet a language as without documentation it has no "grammar". At best it is a patois.

thurston · on May 18, 2011

Grammar: http://svn.complang.org/colm/trunk/colm/lmparse.kl

wzdd · on May 18, 2011

TXL, its apparently predecessor, is very well documented (http://www.txl.ca/). TXL is a very interesting approach to parsing and worth reading up on if you're interested in the area (or are waiting for documentation for Colm :)

scscsc · on May 18, 2011

There seems to be a PhD thesis behind, so you should check it for the grammar.

colomon · on May 18, 2011

It would be interesting to see someone who understood both this and Perl 6's grammars to do a comparison. Based on Colm's quick description and my rough understanding of Perl 6 grammars, they sound like they are roughly equally powerful. But I admit I'm not sure I understand what "transformation language" means...

audreyt · on May 18, 2011

Although similar in expressive power, Colm offers instruction logging to auto-reverse global state changes upon backtracking, something Perl 6 grammars does not (yet) support; at the moment we need to manually manage them with embedded blocks.

chocolateboy · on May 19, 2011

Re: "reverse global state changes upon backtracking": this sounds similar to the (manual) "undo actions" supported by the Kelbt parser [1], perhaps unsurprisingly as it was developed by the same author :-)

[1] http://www.complang.org/kelbt/

colomon · on May 18, 2011

Thanks!

thurston · on May 18, 2011

If they are then I don't deserve to be called "Dr. Thurston!"

on May 18, 2011

[deleted]

audreyt · on May 18, 2011

Interestingly, I parsed thurston's response not as an appeal to authority, but as self-mocking humor (i.e. "wow, if Perl 6 does that too, maybe my PhD was unwarranted!"). I guess English parsing is non-deterministic too... :-)

anigbrowl · on May 18, 2011

Good grief. Have a sense of humor; the grandparent response was obviously made with tongue firmly in cheek.

Twisol · on May 19, 2011

Adrian Thurston (the creator of Colm) is also responsible for the fantastic Ragel state machine generator.

DrCatbox · on May 18, 2011

I am more interested in DSNP, how come this project has not received more fame than the infamous Disapora? http://www.complang.org/dsnp/

thurston · on May 18, 2011

There are some difficult problems in that space. I've posted to HN and reddit a few times, but mostly I've been working on it quietly so I can focus. Lately, that's starting to change. I'll be talking about it at FSW 11 in Berlin in a few weeks.

DrCatbox · on May 18, 2011

My google sense failed me this time around to find information on this FSW 11 in Berlin. Care to explain? Is it a conference, can anybody come?

I am really interested in DSNP and am fairly well versed in GNU/Linux and can do some programming, Java and Python mostly. I work as a web-frontend developer guy. Can I be of some help? Do you need testers, peers, documenters?

thurston · on May 18, 2011

Ya it's currently hard to find. http://d-cent.org/fsw2011/

I need help from people like you actually. What I've done.

1. defined the protocol

2. implmented it in a C++ daemon that

   a) talks to other daemons

   b) serves the content managers (frontend UIs)

3. written a (crappy) example content manager.

What needs to happen next is step 3 needs to be repeated by other people who know what they are doing. They don't need to understand the details of the protocol, they just need to understand the basic model, which is just message broadcast, distributed agreement, etc.

Email me for more details, will get back to you later tonight.

Barrasmara · on May 18, 2011

This kind of sounds like Semantic Design's DMS software Reengineering toolkit and the Parlanse language.

thurston · on May 18, 2011

They are related systems. DMS is much more mature.