Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Owl, a new kind of parser generator (github.com/ianh)
156 points by panic on July 30, 2018 | hide | past | favorite | 42 comments



This looks really great! For casual parsing and data extraction in particular, having a parser generator which does not restrict its users to a particular non-compositional subset of its input grammar (e.g. LL, LR, etc.) is a must, since it does not require the user to understand the particular parsing technique in order to use it. Visibly pushdown languages seem powerful enough to be able to express many practical data formats.

Shameless plug: I saw that you have based the parse tree construction on Dube and Feeley's two-pass algorithm which in its second phase builds a parse tree from a "journal" generated by a DFA in the first recognition phase. This works well, but as you point out, it uses memory linear in the input string for the journal.

While the parse tree must of course still be built and kept in memory, the size of the auxiliary journal can actually be limited to what is semantically required by the ambiguity of the input grammar - in many cases, the bound is O(1). We have written a couple of papers about such a streaming parsing algorithm which you might find interesting for inspiration:

[1] Optimally Streaming Greedy Regular Expression Parsing. https://utr.dk/pubs/files/grathwohl2014-0-paper.pdf

[2] Kleenex: Compiling Nondeterministic Transducers to Deterministic Streaming Transducers. https://utr.dk/pubs/files/grathwohl2016-0-paper.pdf

We never got around to extending the technique to more powerful formalisms such as visibly pushdown languages, although I believe that should be possible.


Thanks! This is something I want to fix eventually. These papers will be very helpful.


"Understandable — like regular expressions", out of context, but I chuckled.


I get why that's funny, but I actually thought that made a good point: regular expressions in the wild can become unruly and difficult unless you break them down, but the concepts of regular expressions are very simple and natural to understand and easy to work with if you're writing your own very simple regexps. I don't know much about traditional parsers, but if this tool simplifies parsers in the same way regular expressions simplified text-based state machines, it sounds like a major innovative step forward!


And owl naturally breaks down grammars, similar to CFG.

Regular expressions can also be broken down - by their recursive definition - but curiously they never seem to be, in pratice. e.g.

    E = ab+bc
    F = x+y
    EF = (ab+bc)(x+y)


"Owl combines the efficiency and performance of regular expressions with the understandibility and ease-of-use of regular expressions." [not an actual quote]


"Owl uses precomputed DFAs and action tables, which can blow up in size as grammars get more complex. In the future, it would be nice to build the DFAs incrementally."

Don't know if you've read this, but it describes how you might build them incrementally:

https://swtch.com/~rsc/regexp/regexp1.html

Also I wonder if instead of using DFA's would it be easier to generate the parser using assembly like instructions, for example:

https://swtch.com/~rsc/regexp/regexp2.html


Is Owl similar to ANTLR? Could you explain the differences between the two languages, and what makes a parser unique if they're similar?

I'm curious because I've quickly run across two of these types of projects now, but admittedly know very little about them.


Yeah, they both solve the same problem -- turning text into a structured "parse tree".

At the core of each parsing tool is a particular parsing algorithm. The details of this algorithm determine what kind of text the parser can parse, how efficiently it can parse it, and what kind of error messages you get if something goes wrong.

There's a huge variety of these algorithms, but nearly all of them are variations on two basic types: "LL" and "LR" parsers. For example, ANTLR uses an "LL(*)" parsing algorithm. This is a great blog post about the difference between the two basic types: http://blog.reverberate.org/2013/07/ll-and-lr-parsing-demyst....

Owl is unique because its algorithm isn't a variation of either of these. It works more like a regular expression parsing algorithm. This makes it more limited than a tool like ANTLR, but it can solve some problems that have been difficult with the existing algorithms.

For example, most programming languages allow expressions like "1 - 2" to be used as statements. If you don't use a semicolon or anything to separate the statements, there's an ambiguity -- is "1 - 2" a single expression, or is it the two expressions "1" "-2" in sequence?

ANTLR (at least with the default settings I tried) doesn't tell you about this ambiguity at all. It just picks one of the two options ("1 - 2" in this case). This may or may not have been what you were expecting! The problems with ambiguity checking are quite deep (this followup to the earlier blog post I linked goes into more detail: http://blog.reverberate.org/2013/09/ll-and-lr-in-context-why...). Owl's parsing algorithm avoids these problems; it can show you the ambiguity in all cases.


It's often I stumble upon discussion mentioning the term LL and LR and other stuff related to parsing[0]. Without a proper CS background, it's quite hard to follow along. That first link[1] and the wikipedia page[2] mentioned there are really great. Many thanks for posting those. It really shed some light about those terms.

[0]: https://jeffreykegler.github.io/personal/timeline_v3

[1]: http://blog.reverberate.org/2013/07/ll-and-lr-parsing-demyst...

[2]: http://en.wikipedia.org/wiki/Tree_traversal


This is really amazing work!

I think the only thing it needs now is to have https://ianh.github.io/owl/try/ wrapped up into an Electron app that can load/save Owl files and generate header files for you right on your filesystem. That would be awesome! Hmm, I think I see a project in my near future (eyes:winking nose:pointy mouth:open-smile)


Owl is much more limited. That's not a problem because the documentation tells outright what it can and can't do and what it's for; experts can get things done better if we have specialised tools for different tasks.

ANTLR claims to be a general parser¹ but it falls short of its promise. People should stop using it; the software is popular only due to its relentless marketing, but not actually very good. It does not return all possible results for ambiguous grammars. It fails dozens of torture tests, including any grammar that's left-recursive, those cause the software to abort at compile time. You can't simply implement a huge number of language descriptions with it. Instead of simply admitting the algorithm's flaw, the maintainer would rather misdirect in bug reports. I find this disgusting.

If you want a general parser that does not disappoint, get Marpa/libmarpa.

¹ https://github.com/antlr/antlr4/blob/4.6/doc/getting-started... "ANTLR v4 takes whatever you give it--it just doesn't give a crap!"


Well, as the author of ANTLR, I will disagree with your assessment of the tool as you can imagine. It never claimed to be a general tool. Specifically, it cannot handle indirect left-recursion; hence, not a general context free grammar parser. It is however the sweet spot between generality and speed. That quote was having a bit of fun using a tiny bit of hyperbole. If you take a look at our academic paper you will see that the general parsers are like walking a minefield. One never knows when they will hit a landmine and the parser takes overnight to parse a file. It took me 30 years to find the sweet spot in power and speed (with the help of Sam Harwell). I welcome the introduction of new tools, but I'm not sure your assessment is accurate nor is your understanding of the parsing landscape.

Also as the paper says, "GLR return multiple parse trees (forests) for ambiguous grammars because they were designed to handle natural language grammars, which are often intentionally ambiguous. For computer languages, ambiguity is almost always an error." When was the last time you wanted to parse a language where the same syntax meant two different things? C++ and a few other cases, sure, but most languages are specifically designed to be unambiguous.

You are welcome to use a tool that handles all context free grammars, but the speed and ambiguity issues are not something I care to deal with.

You also mischaracterize ANTLR's handling of left recursion. It handles immediate left recursion through a rewrite automatically such as for expressions. It does not handle indirect left recursion. You may have not seen ANTLR 4, and are basing your assessment on ANTLR 3? There is no aborting at compile time for left recursion, except in the unusual case of indirect left record.

http://www.antlr.org/papers/allstar-techreport.pdf


> That quote was […] using a tiny bit of hyperbole.

Then you certainly won't have a problem changing the wording to make it true, right? The Owl author explains the limitations of his software up-front, why don't you?

> It never claimed to be a general tool.

The documentation does give the impression.

> If you take a look at our academic paper

Consider that HN is populated by craftsmen and businessmen, not scientists. We judge the tools on how well they work in practical terms.

> One never knows when they will hit a landmine and the parser takes overnight to parse a file.

No wonder, I see it's O(n⁴) for ALL()/ANTLR. You need to keep up with the current state of the art, which is O(n³).

> It took me 30 years

Time and effort does not count for anything, the quality of the end result does.

> I'm not sure […] your understanding of the parsing landscape [is accurate]

I can see the practical results of the various software I tried. Does the software get the task done? Yes, fine, that's a candidate for further consideration. No, fine, I can eliminate it. Documentation does not say…? Now one has to come up with experiments to find out the flaws and limitations, multiplied by every single user. That's a waste of people's time.

> When was the last time you wanted to parse a language where the same syntax meant two different things? […] ambiguity is almost always an error.

That was 2014, and I did not design that language. It does contain ambiguities, it can't be helped. No one gains anything by simply wishing it weren't so; there was a task to be done. There exists software that can cope, you should take that as an incentive to improve yours.

> but the speed and ambiguity issues are not something I care to deal with.

… or don't improve. But why would you refuse to? Is not your reputation at stake?

> You also mischaracterize ANTLR's handling of left recursion. […] It does not handle indirect left recursion.

That's what I meant. I was incorrect when I wrote "any grammar that's left-recursive".


It always amazes me that so many "famous" people still take time to read HN and respond to questions (even insulting ones)! I used Antlr4 in the past and have to say it's a well-designed system, though at the time (2016) the Python runtime which I was most interested in was not very stable.

I've implemented several parser generators myself (nothing as polished as Antlr) and worked on a "rapid prototyping" parser generator tool a while ago (https://github.com/adewes/parsejoy). The idea was a tool that could creates parsers on-the-fly without requiring compilation of any code, and let the user control and modify parsing logic using Lua if necessary. I started writing it in Go but switched to C++ later in order to get more predictable performance.

In the tool I implemented several parsing strategies, notably "parsing expression grammars (PEGs)" as well as a GLR parser (based on the Elkhound paper). And while I find GLR parsing powerful it's not without problems as e.g. errors are hard to debug. That said, it's pretty amazing that you can throw almost any grammar at a GLR parser it and it will (usually) be able to parse it. As you said though writing a "bad" grammar can yield exponential time complexity on some input strings.

PEGs are also nice in general but require you to put more work into the grammar as you are basically required to resolve ambiguities yourself using the "and" or "not" operators. Also, a naive implementation of a PEG parser will have horrid performance as it basically evaluates all possible alternatives via backtracking until it hits a valid one, and for a real-world languages (I've implemented e.g. a Python parser for testing) this will ruin your performance due to the large nesting depth of the rules (Python has around 8-12 levels of rules I think). Using a prefix tree can alleviate this though, as can packrat parsing, which comes with its own cost though as remembering parse results requires memory allocation.

Anyway, in terms of popularity and number of actually implemented grammars, Antlr4 is probably by far the most relevant OS parser generator out there. Thanks for writing it :)

One suggestion I have is to consider improving the tooling for tokenization, as it's often a problem that can be quite challenging in itself. For example, tokenizing Python code is not easy as it's not context-free (due to the indentation that indicates the nesting) and the presence of several types of newline characters (those that occur inside bracketed expressions vs. those that occur outside of them)


Howdy. Agreed. It'd be nice to have a simpler "match x if NOT followed by y" then and something to handle context-sensitive lexical stuff like Python. I often just send all char to the parser and do scannerless parsing. :)


Looks interesting. Can you provide some intuition on which languages can be parsed by this algorithm, and which can't?

Btw, why did you choose the name Owl?


> Can you provide some intuition on which languages can be parsed by this algorithm, and which can't?

The restriction is that any recursion in the grammar has to be inside explicit begin and end tokens. For example, you can't write a rule like

    stmt = 'if' expr 'then' stmt* | expr
because `stmt` refers to itself directly, but

    stmt = [ 'if' expr 'then' stmt* 'end' ] | expr
is OK -- the [ and ] symbols indicate explicit recursion using 'if' and 'end' as the begin and end tokens.[1]

> Btw, why did you choose the name Owl?

I wanted a bird name, and Owl was short and easy to type!

. . .

[1] Though in this case, the language is still visibly pushdown, since you can expand it manually to:

    stmt = (('if' expr 'then')+ expr*)+ | expr
This is because the grammar only has right recursion. Middle recursion (like you get if you add an 'else' clause) can't be expanded like this, so explicit recursion is necessary to parse languages which use it.

To automate this expansion process (and re-association into a sensible parse tree), Owl also has syntax for operators with precedence. Here's an example: https://ianh.github.io/owl/try/#expr


This seems similar to grammars described by DTTs (predecessor to XML Schema), in that it's regular expressions, that must be deterministic, that can recurse, but only when isolated by a tag. That is, recursion always results in nested elements, never a sequence of iterated elements:

    yes:  S -> <a> S </a> | e
    no:   S -> <a></a> S  | e
I've seen these called "tree grammars". Perhaps it's identical with "visibly pushdown grammars"? I skimmed the wiki link (from your readme.md), but it requires further study. I think it would be great if your readme.md included a brief idea, along the lines of what your comment here.


I think they're the same thing, though IIRC tree grammars are usually defined to operate on actual trees rather than strings of text.


I played around with tree grammars. I think there's a formal language reason for them being incapable of parsing expressions properly. I tried anyway, to find a subset or workaround that would work well enough, but failed. (IIRC) At best, I needed parentheses around one of the operator operations e.g. always around +, like (a+b)c or (a+bc)

I'm interested that you have this too, but it seems to have the same problems I found: https://news.ycombinator.com/item?id=17650770

It would be very exciting if you have found a way around this!


Yeah, this is a fundamental problem. Luckily, arithmetic expressions are always left- or right-recursive, never middle-recursive. That means they're recognizable by a regular grammar, even if the parse tree doesn't match what you'd expect:

    expr = number (('*' | '+') number)*
You can replace number by (number | [ '(' expr ')' ]) to add support for parentheses as well.

My solution was to add special syntax which makes rules like this automatically:

    expr = number : num
     .operators infix left '*' : times
     .operators infix left '+' : plus
While the parse tree is built, the operators and operands are rearranged into the expected parse tree using https://en.wikipedia.org/wiki/Shunting-yard_algorithm. Here's an interactive example with some more operators: https://ianh.github.io/owl/try/#expr


Thanks, the eg I tried before works now, I must have misinterpreted the output.

Nice, a language can be recognized, even if a different parse tree.

I vaguely recall the shunting yard algorithm... am I right in characterising your solution as a second level of parsing? It's not "pure", but from a practical point of view, it's no worse than back-references in PCRE. The practical issue is just of ui usability.


Hi, I see that this is your tool, just wanted to say it looks amazing! Also I did not realize how much of a difference syntax highlighting makes for this kind of thing until I clicked through to your Try Owl link, and, wow! Have you used this in anything? Do you know of anyone else using it yet? It looks like you've only been working on this for half a year, so it's still young, but it looks thoroughly built and well thought it. Very impressive!


Thanks! The two biggest uses of Owl right now are Owl itself (see grammar.owl in the root directory) and the "egg" programming language which is included as an example: https://github.com/ianh/owl/tree/master/example/egg-lang. I'm hoping to use it in some future projects as well!


This is actually a full fledged programming language? With two syntaxes (s-expressions and non)?? You are very serious about this! I am astounded.


Handling arithmetic expressions in a simple parser is difficult. The approach here seems limited in the usual way. e.g the following is parsed as two statements (separated by the +)

    print 2*3 + 4


Did you want a bird name because of "nested" words?


Hey- does anyone know how this compares to Marpa? Also, more examples of errors would be great :)


Marpa is a general parser: it can parse any context-free grammar. Owl can only parse the visibly-pushdown grammars, which are a much smaller class, even smaller than the LR(k) grammars that traditional parser-generators can parse.

Besides guaranteed O(n) parsing time, it's not clear to me what the advantages of Owl are, and Marpa runs in O(n) time on any grammar that Owl accepts, and many more (and I think it can or could easily be modified to warn if it can't guarantee O(n) parsing time for a particular grammar).


I know nothing about parsers or Marpa, but I can offer a link for the curious: http://jeffreykegler.github.io/Marpa-web-site/


If you need to parse a file that is more complex than what regex can handle, what does one /generally/ do?

I'm doing some analysis in R, and the data is generated by a program that more or less just spit things out.

Its values are comma separated, but if there are 2 values on one line, it is column 1 and 10. If the first character on a line is a comma, it is not a separator but a value. Commas inside quotations or escaped with backslashes are also values.

The list of special cases and oddities is pretty long. Right now I have a large switch statement that applies different search and replace on all sorts of conditions before feeding the CSV to read_csv.

Is this a case for owl? Can I use it within R or Python?


From what I can tell, your best bet would be to build the generated parser (which will generate some C code) and then build a small interface in C. From there I would just use R/Python's foreign function interface to call your C wrapper.

Alternatively for python you could use a parse combinator like parsec. Depending on how complicated your file can get, using parsec could very likely be the better choice as it has first class support in python and seems powerful enough to handle your use case.

There do seem to be a number of parse combinators in R but none of them seem as well established as parsec for python (which is based on the much more well known parsec for haskell).

A quick look around seems to show that python will be your best bet for parsing as it has some decent tooling. Here is an article on some of the different parse generators/parse combinators in the space.

https://tomassetti.me/parsing-in-python/


I don't see the trick I really like to use - which is for some conflicts (mostly shift/reduce ones) decide to resolve them on the fly at run time (it lets you do things like changing operator precedence on the fly if your language requires it)


Is it possible to parse left-recursive binary expressions with this?

Like:

    expr = expr + expr
    expr = expr.method(args)
I'm trying to write a parser using a parser combinator library, and left-recursive things are turning out to be really tricky.


Yeah, there's explicit support for expressions (left recursive and right recursive) using '.operators', which creates a group of operators at the same precedence level:

    expr =
       identifier : var
     .operators postfix
       '.' identifier '(' args ')' : method-call
     .operators infix left
       '+' : plus


One thing that helped me work around left-recursive restrictions is to stop thinking in terms of full expressions and to start thinking in terms of expressions that begin with terminals and then the suffixes that follow. This means that instead of an expr expanding to the left, it expands to the right.

Let's say you're parsing:

1 + 5 + 7

You can choose to look at it as 7 plus the expression (1+5) or you can choose to look at it as the literal 1 with a +5 suffix that then has a +7 suffix. So in this simplified example, instead of defining:

expr = literal | expr + expr

you'd define expr as:

expr = literal (+ expr)*

That shift in how I thought about the language I was creating basically ended all the left-recursive problems I was having with my parser generator library.


Have you considered Generalized Parser Combinators?

https://epsil.github.io/gll/


I already made a language called OWL ;)

https://github.com/bsurmanski/wlc


Can you parse the owl grammar with owl?






Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: