Langcc: A Next-Generation Compiler Compiler

armchairhacker · on Sept 24, 2022

The auto-generated AST data structures are really the selling point here. Having to manually write an AST for your language and update it if you update your language is mechanical and tedious.

I suggest the developer also look into auto-generated visitor / traversal functions as well, as those are also very mechanical and tedious. Another suggestion is to make sure the grammar preserves source information (file, line, column span) and implement re-printing which preserves comments and whitespace in unmodified sections of the AST. I get those are a lot of work though, as-is this is very impressive.

jzimmerman64 · on Sept 24, 2022

> I suggest the developer also look into auto-generated visitor / traversal functions as well

In fact, langcc already has visitor/traversal functions, though they are not so well documented yet. I have just added an example (expr_simpl in https://github.com/jzimmerman/langcc/blob/main/examples/calc...).

> Another suggestion is to make sure the grammar preserves source information (file, line, column span)

This is already present as well (the calc example illustrates this functionality).

> implement re-printing which preserves comments and whitespace

I considered doing this, but it's surprisingly difficult to get the specification right, especially in whitespace-sensitive languages like Python.

HarHarVeryFunny · on Sept 24, 2022

> The auto-generated AST data structures are really the selling point here

But sometimes you want something custom ...

A fun project I did a while back was implementing a C subset (no structs, pointers or goto) using a directly-executable AST! Kind of a hybrid approach between a compiler and interpreter. This was built as a C++ library to load and execute C-like macros (including a domain-specific library of built-in functions).

What I did (using C++) was have two base classes, Statement and Expression, with respective virtual functions Execute() and Value(). Each type of C statement and expression then derived a corresponding "AST" class from these.

e.g. class IfStatement derived from Statement and implemented the Execute() method. The IfStatement constructor (following the grammar) took an Expression and two Statement base-class pointers as arguments which it stored. The implementation of IfStatement::Execute() was then just:

if (expression->Value() if_statement->Execute(); else else_statement->Execute();

Similar for ForStatement, etc, etc.

The root of the AST was then a CompoundStatement representing the entire program as a list of Statements, and you would then execute the program just by calling the Execute() method on this statement.

In this case I used a recursive descent parser (part of the library I built) to build this "AST", but of course you could do the same for a table driven parser, and it would make a nice demo of how little code you need to write to not only parse but also execute/evaluate whatever you are parsing.

layer8 · on Sept 24, 2022

I don’t find this to be very impressive. I used SableCC [0] 15 years ago, which generates Java source code (classes) for the AST along with the parser, including Visitor traversal. It was missing line number information and character data between tokens, but that was straightforward to add by modifying the built-in templates used to generate the AST classes. I was also able to add generically-typed return values to the Visitor mechanism just by editing the built-in templates.

[0] https://sablecc.org/

mhh__ · on Sept 24, 2022

Auto generating stuff also forces you to not bolt tons of shit into all the last nodes

benjaminjosephw · on Sept 23, 2022

> langcc is general enough that the tool is self-hosting: that is, one can express the "language of languages" in the "language of languages" itself

There's something deeply satisfying about recursive tools who's inputs can include the definition of the tool itself. Brilliant.

cakoose · on Sept 24, 2022

That claim is sort misleading. The complexity of the specification's own grammar is largely unrelated to it's expressiveness.

I'd expect most parser generators in this category to be able to parse their own grammars.

Imagine a very weak parser that can only handle LR0. But if it uses a Lisp-like grammar language, it too is self-hosting.

kazinator · on Sept 24, 2022

> The complexity of the specification's own grammar is largely unrelated to it's expressiveness.

E.g.:

  bitstring : bistring '1'
            | bitstring '0'
            | /* empty */
            ;

this expresses all that can be expressed. The rest is semantics.

jzimmerman64 · on Sept 24, 2022

The language of BNF for langcc (https://github.com/jzimmerman/langcc/blob/main/grammars/meta...) provides many syntactic conveniences that are not present in a Lisp-like language, so the fact that langcc supports it is a nontrivial achievement. In particular, an LR(0) parser would not be anywhere near adequate for it.

cakoose · on Sept 25, 2022

I wasn't saying langcc was not powerful, just that the shape of the argument doesn't make sense.

That part of the README goes something like this:

1. It can parse Python and Go efficiently.

2. In fact it's so expressive that it can even parse itself, which is a "language of languages".

If you had first shown some hard-to-parse langcc syntax, then sure, _that_ would be evidence of expressiveness. But there's nothing impressive about being able to parse a "language of languages", since a language of languages can be LR(0).

kragen · on Sept 23, 2022

The OG of recursive parser generators whose inputs can include the definition of the parser generator itself is D. Val Schorre's META-II, which compiles itself (or other compilers written in the same grammar language) to an assembly language for a parsing-oriented abstract machine. Thanks to the ACM's enlightened policies, the META-II paper is now freely available (though not yet, as far as I can tell, open access): https://dl.acm.org/doi/10.1145/800257.808896

I wrote an improved version called Meta5ix http://canonical.org/~kragen/sw/dev3/meta5ix.m5 (an interpreter for the virtual machine, with a complete description, is at http://canonical.org/~kragen/sw/dev3/meta5ixrun.py, with the precompiled grammar at http://canonical.org/~kragen/sw/dev3/meta5ix.generated.m5asm) and adapted it to produce C: http://canonical.org/~kragen/sw/dev3/meta5ix2c.m5 (precompiled version at http://canonical.org/~kragen/sw/dev3/meta5ix2c.c).

Meta5ix written in itself is only 18 lines of code. Briefly, "" encloses expected input, {} encloses a line of output, [] encloses repeating constructs, commas separate alternatives (implemented without backtracking), fnord skips leading whitespace, $it copies the last <<>>-enclosed input token to the output, and other $variables interpolate locally-defined assembly labels.

    - program: ["-" name @it ":" terms {return}, "#" [:notin ""]]
    - terms: term ["," {continue $choice} term] @choice
    - term: (factor {else $seq}, output) [factor {assert}, output] @seq
    - factor: string {literal $it}
           , "(" terms ")"
           , "[" @many terms {continue $many} "]"
           , name {call $it}
           , "<<" {begin} terms ">>" {end}
           , ":fnord" {fnord} factor
           , ":notin" string {notin $it}
           , ":between" string {between $it}
    - output: (quasiquote, "@" {dedent} (var, quasiquote)) {writeline}
    - quasiquote: "{" [(<<ch [ch]>>, "\" <<:notin "">>) {say "$it"}, "$" var] "}"
    - ch: :notin "$}\"
    - var: "it" {copyinput}, name {gen $it}
    - string: :fnord <<'"' [:notin '"'] '"', "'" [:notin "'"] "'">>
    - name: :fnord <<letter [letter, :between "09"]>>
    - letter: :between "az", :between "AZ", :between "__"

Meta5ix, like my earlier peg-bootstrap https://github.com/kragen/peg-bootstrap/blob/master/peg.md (66 lines of code, compiles to JavaScript, supports backtracking), is not really something you want to write a parser for your application language in. It's a compiler-compiler you can extend to support the parsing constructs you actually want for your language, then recompile with itself. Dave Long described META-II as "a field-improvised lever", and I think that's true of these as well, but maybe an even better analogy is Archimedes's fixed point.

tempodox · on Sept 24, 2022

What does “OG” stand for?

eesmith · on Sept 24, 2022

https://www.urbandictionary.com/define.php?term=OG

> OG used to mean Original Gangster allthough some poeple these days use OG as a quicker way of saying Original

kragen · on Sept 24, 2022

Yes, I meant that META-II was the Original Gangster of self-compiling compiler compilers.

XorNot · on Sept 24, 2022

I kind of hate systems like this though. What I want is easily bootstrappable compilers: how hard is it to get the system back from nothing but raw x86_64 machine code and some type of data storage.

npalli · on Sept 24, 2022

The linked paper is impressive on the golang and python implementations, possibly related to having a subset of features.

“ The tool handles both languages automatically, and the generated code, when run on standard codebases, is 1.2x faster than the corresponding hand-written parser for Golang, and 4.3x faster than the CPython parser, respectively.”

jzimmerman64 · on Sept 24, 2022

As far as we are aware, the grammars included with langcc (go.lang and py.lang) parse the full industrial programming languages Golang 1.17.8 and Python 3.9.12. They are not restricted subsets (even though cc.lang is a restricted subset of C++).

npalli · on Sept 25, 2022

Thank you, that's impressive. Good work.

lalaithion · on Sept 24, 2022

Some questions it would be great to feature in the README:

- How are the error messages? - Does it produce source spans for intermediate AST nodes? - Can it handle operator precedence without encoding the precedence into the grammar? - Can it handle source-code-defined operator precedence?

jzimmerman64 · on Sept 24, 2022

The README is a work in progress, but some of these answers can be found in the documentation (https://arxiv.org/abs/2209.08385). In particular:

> How are the error messages?

langcc produces "confusing input pairs" for LR conflicts, which enable the user to pinpoint the cause of the error much more easily than the opaque shift/reduce errors in tools such as yacc/bison.

> Does it produce source spans for intermediate AST nodes?

Yes! You can find an example in https://github.com/jzimmerman/langcc/blob/main/examples/calc....

> Can it handle operator precedence without encoding the precedence into the grammar?

Yes - detailed in the documentation.

> Can it handle source-code-defined operator precedence?

Depends on what you mean by source-code-defined. The operations supported by langcc's precedence stanzas (detailed in the documentation) are quite comprehensive, and suffice to encode the operator precedence rules in Golang 1.17.8, for example.

lalaithion · on Sept 24, 2022

Thanks! In general, I would avoid putting the documentation for a programming tool in a pdf hosted on Arxiv. It didn’t even occur to me that either of those links would be useful documentation. Additionally, pdf form isn’t the most helpful way to present documentation either.

mdaniel · on Sept 25, 2022

See if you find this easier to read: https://github.com/mdaniel/langcc/blob/documentation/2209.08...

duped · on Sept 23, 2022

Does it generate a batch or reactive compiler and allow for custom error recovery/messaging?

It's always cool to see new tools but at a glance this doesn't seem to solve the problems that necessitate a hand written parser.

sanxiyn · on Sept 23, 2022

Just about all parser generators allow for custom error so this one does too. "Hand written parsers have better errors" is just an internet meme, not the actual state of the technology.

duped · on Sept 24, 2022

...so users write specific rules in their grammars for error cases and then add custom error messages?

The actual state of the technology are tools like tree sitter, which doesn't generate an abstract syntax tree like this one does.

I've written many parsers in my life. They're almost exclusively recursive descent because a generated parser is inferior in many, many ways. A lot of parser generators miss the forest for the trees - lexers are usually trivial to write and a hand rolled recursive descent parser takes about a day (and lets you handle precedence in a sane way that can be extensible!) - usually for better performance, easier maintainability, and good error handling.

What's super hard to do is an incremental query oriented parser and compiler with a concrete syntax tree that doesn't use ungodly amounts of memory and has bad performance. Generative tools for that architecture are few and far between but much more useful than lex/yacc alternative.

bsder · on Sept 24, 2022

Is this really a meme? If so, I'd like a counterexample of a compiler that people regard as having actively good error messages that uses a parser generator.

It seems like all the compilers I know of got much better errors when they switched to a fully custom generated intermediate language rather than one from a parser generator.

rurban · on Sept 23, 2022

Looks good, but I'm sceptical that I can specify perl5 or raku with it, as this would need dynamic flex rules. python, php or ruby should be fine though.

dontlaugh · on Sept 23, 2022

Perl 5 is famously impossible to parse without executing, so no parser will be able to handle it.

rurban · on Sept 24, 2022

That's not the point. flex is dynamic and can change the AST. langcc should also. perl5 uses a normal flex/bison parser setup.

KerrAvon · on Sept 23, 2022

It includes a grammar for Python but says it has some limitations, so maybe, sort of. Amusingly, the included C grammar is shorter than the Python grammar.

Ruby has some syntactical edge cases that make parsing a bit tricky.

defen · on Sept 24, 2022

The C grammar is also not complete, so it doesn't necessarily mean anything that it's shorter than the Python one.

jzimmerman64 · on Sept 24, 2022

Correct. The C/C++ grammar is not complete; it only encodes the subset of the language that was needed to bootstrap langcc itself.

As far as I am aware, however, the Python grammar parses the full industrial language Python 3.9.12.

defen · on Sept 24, 2022

Do you think it would be possible to use this tool to create a full C frontend (including preprocessor)? I imagine the difficult part would be passing preprocessed text (including generated source e.g. pasted tokens, and expansion locations) into the C parser.

jzimmerman64 · on Sept 25, 2022

The preprocessor would likely have to be a separate ad-hoc step, but it is possible that langcc could be used to generate a full C frontend if one is willing to run two passes (needed to deal with ambiguous cases such as "x * y", which can be either a multiplication expression or a variable declaration, depending on the contents of the symbol table).

jzimmerman64 · on Sept 24, 2022

Readers who are interested in parsing theory may also enjoy the companion technical report:

https://arxiv.org/pdf/2209.08383.pdf

as well as a brief users' manual, available here:

https://arxiv.org/pdf/2209.08385.pdf

HarHarVeryFunny · on Sept 24, 2022

Meh .. I wrote a full LR(1) parser generator for fun about 40 years ago (c.1982), back when it was considered impossibly impractical (the "dragon book" said so!).

Generating an LR(1) parser itself is simple enough using Knuth's algorithm - you're just building a large state machine and having to maintain sets of look-head tokens for each partially-parsed production rule in the states, which is why the resulting state machine becomes impractically large (at least for computers at that time).

What made this possible back at that time was a paper by Professor Pager from the university of Hawaii that described how you could merge states, under certain compatibility conditions, during the cannonical Knuth parser (state machine) generation algorithm, which kept the size of the state machine under control.

The parsers built by parser generators like mine, YACC, OP's Langcc, etc, are referred to as "table driven parsers" (to differentiate them from recursive descent parsers), but back in 1982 representing the state machine as an actual table wasn't very practical (too big) so rather than having a large sparse 2-D array indexed by state and token you'd normally represent each state as a list of actions instead.

However, I found I could represent the tables an sparse 2-D array, but then compress the array to make it on manageable size. This was much preferable to using lists since access was much faster - just 3 array accesses to translate raw state, token indeces into compressed array indeces, then the actual lookup.

To compress the sparse array I converted it into a graph coloring problem. I mapped the sparse array to a graph where each row of the array would map to a node in the graph, and any two nodes would be connected only if the corresponding sparse rows had a conflicting entry in any column. I then used a simple graph coloring heuristic to color the graph (goal - color graph with minimum number of colors, given constraint that connected nodes can't have the same color), then converted back to a compressed array by merging rows whose corresponding graph node's had the same color (hence no conflicts). I then repeated same procedure for the columns.

The resulting parsers, when coupled with an optimized hand written tokenizer, was stunningly fast, even on the dog slow computers of the time, but sadly I never did anything with it - was just a fun side project done for my own education/amusement.

The good old days. Get off my lawn.

jzimmerman64 · on Sept 24, 2022

> back when it was considered impossibly impractical (the "dragon book" said so!)

Yes, with some optimizations (e.g., Pager's algorithm), LR(1) can be practical; in langcc we even implement LR(k), and this is also practical for many real-world grammars.

However, a key observation is that even LR(k) is not enough for many industrial language features. You may enjoy reading the companion technical report for langcc, which delves deep into the underlying parsing theory and proposes a number of new techniques:

https://arxiv.org/pdf/2209.08383.pdf

HarHarVeryFunny · on Sept 24, 2022

> However, a key observation is that even LR(k) is not enough for many industrial language features.

Yes, although sometimes a minor hack is all that's needed. For C/C++ you can maintain a symbol table while parsing, then have your "lexer" return a different token for a type-name vs other identifiers.

pabs3 · on Sept 24, 2022

This might be useful for bootstrapping languages from other languages.

https://bootstrappable.org/

ridiculous_fish · on Sept 24, 2022

Some questions I have

1. Is there support for comments beyond just the lexer, e.g. attaching them to nodes for a source beautifier?

2. Is there support for parsing incomplete input?

I got it building, should anyone be curious about the generated source I posted it here for the "calc" example:

Header file: https://gist.github.com/ridiculousfish/63449ca95d3f957f53a85...

.cpp file: https://gist.github.com/ridiculousfish/32b053325b698adda782f...

the_duke · on Sept 24, 2022

There's a lot of focus on parsing and AST definition in the language space.

It's true that both of those are a bit tedious to write, but in reality they represent a tiny amount of the work that goes into writing a language.

It's nice for experimentation, but you will soon need to customise your AST, add interning, have a permissive parser with error recovery (which always depends on language semantics) , propagate errors through your AST, ...

Most people who try to implement a language never get beyond the parse code to AST stage, so this would be nice for students and beginners.

But I don't think it's impactful beyond that.

jzimmerman64 · on Sept 24, 2022

Many of these claims are subjective, but I will at least point out one fact: langcc is general enough to compile its own frontend - and not just for experimentation.

mingodad · on Sept 24, 2022

It seems too hack depending on "__attribute__((always_inline))", tcmalloc for performance, specific libunwind to dump backtraces, doesn't build with g++.

Maybe the general idea is valuable but the implementation ?

jupp0r · on Sept 23, 2022

> automatically generates a compiler front-end

For a very narrow definition of compiler frontend. The missing piece to make this super awesome would be generating LLVM IR, which is where the semantics of the language would need to be understood (vs just the grammar).

tgv · on Sept 24, 2022

Parser generator would be more apt.

jzimmerman64 · on Sept 24, 2022

It's not just a parser generator; it also generates the AST, traversals, lexer, and pretty-printer. And it can generate the AST for your IRs and target language, if you specify them in BNF as well. This is a pretty significant piece of a compiler, all told.

tgv · on Sept 24, 2022

Generating the tree is trivial. The lexer is part of the parser. A pretty printer is not part of a compiler. The rest you have to do by hand. I know the name is a play on yacc (Yet Another Compiler Compiler), but it generates only a part of a compiler, the parser.

jzimmerman64 · on Sept 24, 2022

> Generating the tree is trivial.

If by "generating the tree" you mean generating definitions for the AST data structures, this is nontrivial enough that yacc does not do it.

> The lexer is part of the parser.

Again, yacc (a parser generator) does not include a lexer generator; that requires a second package such as lex.

> A pretty printer is not part of a compiler.

A pretty-printer for the IR or target language is an important part of a compiler, as one needs to render the output in a form that some other tool (e.g., an assembler) can accept. A pretty-printer for the source language also comes in handy.

> The rest you have to do by hand.

What you have to do by hand is to write a function from the source AST to the target AST (i.e., a compiler backend). langcc generates the whole frontend, not just the parser.

Darmani · on Sept 24, 2022

I have not published any papers on parsing, but I've read a large chunk of the Handbook on Parsing, am buddies with Terrence Parr and have given a lecture on the ALL() parsing algorithm, and worked for a company that has perhaps the most extensive collection of GLR parsers around. (Whereas langcc boasts automatically-generated parsers for the easy-to-parse Python and Go, my old employer has automatically-generated parsers for Ruby, C++, COBOL, and about 100 others.) I would not consider myself an expert on parsing in comparison to the few who research it, but I would consider myself expert by the standards of most professional compiler engineers, and certainly by the standards of the average software engineer.
When I see a library like this, I ask "What new things does this thing bring to the table?" For that, before I even get into trying to understand what it does, I want to see how well the author understands what's already been done and how it compares to prior work.
My very blunt assessment: I don't see any reason to take this library seriously.
I looked at the 40 page paper. I can tell just from the citations that it takes a very 70's few of parsing. And when it does discuss other work, it mostly gets it wrong. A few examples:

Page 8: He writes about the "ETL family, also called GLR parsing." "ETL family" seems to be a term of his own coinage, and he's writing about GLR parsing and Earley parsing as if they're the same. Now that I think about it, I see some resemblance, but they're usually considered quite different. See e.g.: https://cstheory.stackexchange.com/questions/30903/why-tomit...

* Page 9

> There has also been significant development of the top-down family of parser generators, including tools such as ANTLR [PQ95] which support a superset of LL(k) grammars, as well as their parallel counterparts, referred to as Generalized LL (GLL). For the reasons described above, we believe that while top-down parsers are theoretically interesting, the grammars they support are not sufficiently general to capture full programming languages.6 Notably, they are significantly more restrictive than LR(k), and we argue that even the latter is not sufficiently general in its standard formulation.

This has not been true since ANTLR 2, released 25 years ago. The LL() algorithm of ANTLR 3, described in a 2011 publication, is capable of parsing even some context-sensitive grammars.

Page 9:

> The tree automata of Adams and Might [AM17] are similar to our implementation of monotonic attributes; they differ in that they impose left-to-right sequencing on the attribute constraints (whereas attributes can refer to arbitrary constituents of a grammar rule), and require productions to be duplicated (whereas attributes are boolean or integer values layered independently on top of the standard context-free grammar rules)."

This is one of my favorite papers of the last 5 years, and this is totally inaccurate. The user writes down a normal grammar and syntactic regular expressions for forbidden subterms. From that, it automatically generates a new grammar that e.g.: breaks symmetries.

In short, I think the author has the knowledge to claim that his solution beats the best that the 70's and 80's has to offer. I am not motivated to look at this any further because I've been given no reason to believe it's had a fair comparison to what the 2010's has to offer. I continue to recommend ANTLR 4 for your parser-generator needs. Its ALL(*) algorithm truly is magic. tree-sitter is also good and has a lot of momentum.

Joe Zimmerman: If you're reading this, please don't take it too personally. I'm reacting to the claims of the work more than the work itself. If you'd like, I'll gladly discuss this with you. Either I'll learn something new, you'll get better at clarifying your ideas, or both.

camgunz · on Sept 24, 2022

I ran into a couple issues using ANTLR 4. The first is that it mostly required my build pipeline to have a JVM, which wasn't super convenient. The second is that the generated Python parsers are super slow, like too slow for even just iterating while developing.

I almost wish ANTLR 5 wouldn't be about any new parsing stuff, but just a refocusing of the project on implementing its compiler-generator in the popular languages. It basically is that already, but it's pretty clearly not first class.

samsquire · on Sept 24, 2022

Someone has done something awesome and good, why not start from that perspective of supporting and recognising someone else's talent.

From my understanding of your undertone of your message you are telling them they don't know what they're talking about when I don't think that's true.

There's no need to put people down.

Darmani · on Sept 24, 2022

Satirical answer: Because I'm a crusty old scientist who needs to defend his turf.

Self-serving answer: I'm doing the author a favor as perhaps the first person with my level of knowledge of the field putting effort into giving feedback on the work.

Serious answer: Because it's presenting itself as an artifact of scientific novelty rather than as an engineering accomplishment. It is making major claims not only about itself but also about the entire state-of-the-art. (You could say that the langcc papers put down the entire field of parsing.) Its contributions and scholarship therefore need to be evaluated like any other scientific work.

jzimmerman64 · on Sept 24, 2022

> Whereas langcc boasts automatically-generated parsers for the easy-to-parse Python and Go, my old employer has automatically-generated parsers for Ruby, C++, COBOL, and about 100 others.

Python may be (comparatively) easy to parse, but Golang is not - it is _barely_ LR(1) with the extensions described in https://arxiv.org/pdf/2209.08383.pdf (notably, semantic attributes and the CPS transformation).

> my old employer has automatically-generated parsers for Ruby, C++, COBOL, and about 100 others

GLR parsing, as I note in the paper, is very different from pure LR(k) parsing. Are the automatically-generated parsers you mention guaranteed to run in linear time? Are they more efficient than hand-written recursive-descent for the same languages?

Not to mention that full C++ parsing, as I also note in the paper, is undecidable, due to simple ambiguities such as "x * y;" (whose parse depends on whether x names a value or a type, which may depend on template computations). So it should be impossible to fully automatically generate a parser for C++ from a declarative BNF spec, without some significant ad-hoc machinery on top.

> he's writing about GLR parsing and Earley parsing as if they're the same. Now that I think about it, I see some resemblance, but they're usually considered quite different.

I'm not calling them "the same"; I'm grouping them together into the same family, as they share some particular characteristics relevant to the paper (polynomial but not necessarily linear time, left-to-right, dynamic-programming-style).

> This has not been true since ANTLR 2, released 25 years ago. The LL() algorithm of ANTLR 3, described in a 2011 publication, is capable of parsing even some context-sensitive grammars.

Again, the implicit claim here is "generated from a declarative BNF spec, efficiently, with guaranteed linear time, parsing full industrial languages...". Of course one can parse many languages with backtracking. Do you have examples of full industrial languages parsed with ANTLR 3 that satisfy the relevant criteria?

> This is one of my favorite papers of the last 5 years, and this is totally inaccurate.

Could you say what, specifically, is inaccurate about it?

> In short, I think the author has the knowledge to claim that his solution beats the best that the 70's and 80's has to offer. I am not motivated to look at this any further because I've been given no reason to believe it's had a fair comparison to what the 2010's has to offer.

Again, I would ask you to exhibit concrete results here. Show me an existing tool that is already capable of generating a parser for Golang 1.17.8, from a declarative BNF spec, with guaranteed linear-time parsing, and that matches or exceeds the hand-written parser in speed.

Darmani · on Sept 24, 2022

Hello Joe! Always good to see an author responding to criticism.

> Again, the implicit claim here is "generated from a declarative BNF spec, efficiently, with guaranteed linear time, parsing full industrial languages...". Of course one can parse many languages with backtracking. Do you have examples of full industrial languages parsed with ANTLR 3 that satisfy the relevant criteria?

I would suggest you lead with this and make the implicit claim explicit. The main interesting part of the claim is guaranteed linear time, something which is interesting and (to my knowledge) novel (although not what's important to me personally in parsing). I did not gather from my level of inspection that guaranteed linear-time was a claim of this work. What I gathered instead is the claim that it can improve on 70's LALR parser generators and that it attacks a consensus that handwritten parsers are necessary for industrial languages.

Both of these claims that I actually gathered decreased my motivation to read more about this work. In particular, I'd say the "consensus" that hand-written parsers are necessary is old hat, and, in the circles of people who are actually interested in better ways of constructing tools, the consensus is more the opposite. I can't say I've done an explicit poll of the software language engineering community, but everyone in the community who's brought this up has said something along the lines of "the days of handwritten parsers are over." The last time I heard this opinion was last week at ICFP (I believe said by Michael Adams, in fact).

I've now learned a (to me) new claim, that a key part is the guaranteed linear-time parse. This is great on its own even as a pure theory result. But I'll need to read a lot of highly-technical development to evaluate this claim, and the other claims discourage me from investing the effort.

Speaking of which, I highly encourage you to submit your work to SLE ( https://2022.splashcon.org/home/sle-2022 ), or, if you're feeling more ambitious, OOPSLA. These are the main venues I know of where parsing results are still published. (SLE in some years has had an entire parsing-focused sub-conference.) If the claims you're making are true, then this is work that needs to get the stamp-of-approval of peer review from people more expert than I, and for everyone to know about. There's an early-submission process for OOPSLA to get feedback far in advance of the conference; the deadline is in just over a month.

> I'm not calling them "the same"; I'm grouping them together into the same family, as they share some particular characteristics relevant to the paper (polynomial but not necessarily linear time, left-to-right, dynamic-programming-style).

Cool! I agree. This will be a great way to revise your description in the next draft of the paper if you're inclined to make one.

> Could you say what, specifically, is inaccurate about it?

I believe I did in my last comment. The main thing being the duplication of productions. The only interpretation of this that makes sense to me as a claim about the paper (as opposed to a claim about CFGs in general) is that it somehow requires the user to write a production or variants thereof twice, which is false. "Left-to-right sequencing on the attribute constraints" is also nonsense to me. I think it's requiring me to reinterpret the results of that paper in terms of the algorithm of XLR, which I don't like as a way to evaluate a competing work. Read on its own, I again can't think of anything in the Adams paper which involves "left-to-right sequencing" of anything.

> Again, I would ask you to exhibit concrete results here. Show me an existing tool that is already capable of generating a parser for Golang 1.17.8, from a declarative BNF spec, with guaranteed linear-time parsing, and that matches or exceeds the hand-written parser in speed.

I have to congratulate you for having this benchmark that I cannot name a work that matches it.

This raises the question of why this should be a benchmark of concern. Declarative BNF spec, sure. Guaranteed linear-time, you don't even need a benchmark to claim significance if the proof is accurate. But why Golang, and why comparison to the handwritten parser? E.g.: I'm not aware of Semantic Designs (the company with GLR parsers for Ruby, COBOL, and C++) even having a Golang parser, as they mostly deal with legacy systems. I am not prepared right now to say how significant the perf comparison is. I don't know how the result was achieved (like whether it's been achieved by well-understood perf engineering techniques that could be replicated in competitors), nor how strong the baseline is. I've done paid work in 30+ languages and studied many more, but Golang is not one of them.

I can say that I misremembered the ALL(*) paper showing spectacular performance results compared to all other parser generators. In fact, they only gave the comparison for the Java parser. So I was probably overconfident in its performance

Anyway, if you want to discuss this more, or you want an intro to a real parsing expert, I'm happy to; my contact information is in my profile. Now you have at least one datapoint about how someone versed in modern parsing is reacting to this work.

jzimmerman64 · on Sept 25, 2022

> everyone in the community who's brought this up has said something along the lines of "the days of handwritten parsers are over."

I am aware of this "consensus", but I believe it is largely wishful thinking. Prior to langcc, the community had yet to propose an alternative that is general and efficient enough (esp., linear-time) to replace hand-written recursive-descent parsers for industrial languages such as Golang. lex and yacc are not sufficient for this. It is only with the advances of the langcc paper (improved optimization procedures, CPS transformation, semantic attributes, etc.) that this has become possible.

> I believe I did in my last comment. The main thing being the duplication of productions. The only interpretation of this that makes sense to me as a claim about the paper (as opposed to a claim about CFGs in general) is that it somehow requires the user to write a production or variants thereof twice, which is false.

My comments were referring to the compilation of tree automata to CFGs, which seems to be necessary in order to use them with standard LR parsing techniques. The alternative seems to be to use the automata to filter sets of possible parses after-the-fact, which does not by itself lead to an efficient parsing algorithm.

> But why Golang, and why comparison to the handwritten parser?

Golang is a representative example of the formidable complexity, and near-ambiguity, of constructs that are commonly used in industrial languages - and which people often take as a justification that handwritten parsers are necessary. In order to be competitive with hand-written parsers (and thus support the "consensus" that hand-written parsers are no longer necessary), a generated parser should at least be guaranteed linear-time, and should be comparable in terms of concrete efficiency benchmarks.

Darmani · on Sept 25, 2022

Okay, we seem to be whittling down on the points where we disagree.

Initial disagreement:

* langcc/XLR is a major breakthrough in the ability to generate parsers for industrial languages

---- Positions:

----------- Joe: Yes

----------- Me: Unclear. Several other works in the past 20 years have similar claims. Benchmark selected by langcc is of unclear significance. langcc paper is lacking indicia of reliability to investigate its linear time claim.

Current crux:

* Linear time guarantee is a bright-line barrier between parser generators suited for tool or compiler development on industrial languages and parser generators not so suited.

------ Positions:

---------- Joe: Yes

---------- Me: No

Is this a fair summary of the discussion so far?

I can say that this is a crux for me because, if you convince me that the linear-time guarantee is this important and that langcc has it, I will be much more likely to crow about / evangelize langcc in the future.

> My comments were referring to the compilation of tree automata to CFGs, which seems to be necessary in order to use them with standard LR parsing techniques. The alternative seems to be to use the automata to filter sets of possible parses after-the-fact, which does not by itself lead to an efficient parsing algorithm.

Cool! So, the distinguishment is more about its use of CFGs as a compilation target than about the main meat of the work. I find this a much clearer description of how you intend to compare Michael Adams's paper than the one I quoted.

-----------------------------------------------------

For the rest of your comment, a few claims:

* "Golang is a representative example"

* "In order to be competitive with hand-written parser [...] a generated parser should at least be guaranteed linear-time, and should be comparable in terms of concrete efficiency benchmarks. "

* "the community had yet to propose an alternative that is general and efficient enough (esp., linear-time) to replace hand-written recursive-descent parsers for industrial languages such as Golang"

This is an interesting set of claims because all the arguments I've heard in the past in favor of handwritten parsers were about error messages, not perf. I know langcc has an idea for that as well, but you seem to be focusing on performance as the real breakthrough.

So, question:

Would you accept a generated parser for Ruby, C++, Haskell, Delphi, or Verilog that is empirically within 3x the parsing speed of the handwritten alternative as sufficient evidence that some existing work is already competitive on this front?

If not, why?

jzimmerman64 · on Sept 25, 2022

> I can say that this is a crux for me because, if you convince me that the linear-time guarantee is this important and that langcc has it

The fact that langcc has the linear-time guarantee follows from the fact that it generates shift/reduce parsers that take steps according to an LR DFA. If you are not already convinced that this property is important, I am not sure that I can convince you, except to ask whether you would be comfortable running, e.g., a worst-case-O(n^3) parser for a real industrial language in production. (Or, for that matter, running a parser such as GLR which may fail with an ambiguity at parse time. langcc parsers, of course, will report potential ambiguity as an LR conflict at _parser generation time_; once langcc produces the LR DFA, its behavior at parse time is unambiguous.)

There are also many other important theoretical innovations in the langcc paper, e.g., the LR NFA optimization procedures, the conflict tracing algorithm, and XLR.

> This is an interesting set of claims because all the arguments I've heard in the past in favor of handwritten parsers were about error messages, not perf. I know langcc has an idea for that as well, but you seem to be focusing on performance as the real breakthrough.

Yes, langcc also features a novel LR conflict tracing algorithm, which makes conflicts very easy to debug at parser generation time. I am not sure if I would call performance the "breakthrough"; more so that linear-time performance is a _prerequisite_, and the theoretical breakthroughs of the langcc paper are what allow us to satisfy that prerequisite.

> Would you accept a generated parser for Ruby, C++, Haskell, Delphi, or Verilog that is empirically within 3x the parsing speed of the handwritten alternative as sufficient evidence that some existing work is already competitive on this front?

I have not studied Ruby, Delphi, or Verilog extensively, so could not comment on those languages at the moment. I know that Golang 1.17.8 is sufficiently difficult to parse as to be a good benchmark. I am not sure whether Haskell is sufficiently difficult to parse, but would at least be _interested_ in results for it. C++ parsing is undecidable, so that would seem to rule it out as a candidate for a fully automatically generated parser.

Darmani · on Sept 26, 2022

> If you are not already convinced that this property is important, I am not sure that I can convince you, except to ask whether you would be comfortable running, e.g., a worst-case-O(n^3) parser for a real industrial language in production. (Or, for that matter, running a parser such as GLR which may fail with an ambiguity at parse time.

Looks like it stops here. Yes, I am comfortable running such algorithms in production. Github code-nav uses GLR parsing (via tree-sitter) on a large fraction of the code in existence.