Just Write the Parser

samatman · on Oct 21, 2020

An alternate title: "Just Shotgun Parse Your Inputs".

This isn't bad advice across the board, but it is bad advice for many applications.

This is how you end up with context-sensitive formats, or ones where the semantics of a particular particle require running the parser, or worse. It exposes a rich surface area for exploitation with adversarial inputs.

That said, the advantages cited at the front of the article are all real: error recovery and decent error messages are unsolved problems with existing parser generators, although ANTLR is pretty good at it.

But, especially if you're developing the data format rather than just implementing it, I strongly suggest using something like Instaparse which can generate a parser directly from a grammar. Once this is solidified, it's reasonable to hand-roll something, but even there, I suggest using a nice parser combinator library, like hammer or Nom.

This provides more discipline: as long as you're coloring inside the lines, you'll end up with a parser which actually implements your spec. When you start adding functions to provide context, recover from errors, and provide meaningful feedback in the form of error messages for unexpected inputs, you'll know what you're doing: a classic recursive-descent parser can't provide the same conceptual separation between the parsing logic and the logic to support ergonomics or one-pass compiling or whatever add-ons you're putting in there.

thequux · on Oct 21, 2020

I agree with everything you say, except your advice to use a parser combinator library, because most implement PEGs. Hammer is, of course, an exception to this, as long as you call `h_compile` to tell it to use a different backend.

Why not use PEGs? In short, they don't actually remove ambiguity from your grammar, but rather hide it in ways that are difficult to reason about. You're still dependent on code as a definition of the language that you accept, and worse, that code is very fragile in the face of refactoring. Further, the lexical analysis step is generally the source of most ambiguity, and most parser combinator libraries make it extremely inconvenient to parse something that isn't a character stream. (Hammer is actually worse than most here; it's not possible to use it to parse anything other than streams of bytes, though perhaps it can be excused because of the limitations of C)

samatman · on Oct 21, 2020

I firmly disagree about the value of PEGs (and hence combinators) as a formalism that's ambiguous or hard to reason about. PEGs are as well-specified as CFGs, they just work in a different way. One of the strengths of PEGs is that they're never ambiguous. That does mean that order of alternates is important, and sometimes you do have to fiddle with that order when translating something like ABNF. But what you get in return is never having to deal with a parse forest, let alone punting on that problem by just reifying the greedy parse as the correct one.

I've written a tool that converts a declarative specification of a PEG grammar into a parser, although I haven't released it yet. It works great, I've used it on a number of real-world structured data formats.

Edit: cut out the first paragraph after checking the user name. Hi TQ.

tomp · on Oct 22, 2020

"PEGs are never ambiguous" is equivalent to solving a problem by proclaiming the wrong solution to be correct.

In that sense, LALR parser generators (e.g. yacc) are also never ambiguous, because even if your grammar is ambiguous in the formal sense, yacc will produce a working parser with well-defined resolution of conflicts...

But the reality is,

    E = E `+` E

is intrinsically ambiguous, no matter how you want to spin it... the only difference is, that LALR generators point that out, whereas PEG generators sweep it under the rug.

samatman · on Oct 22, 2020

I don't agree, as I've said, because PEG is a declarative formalism, and LALR is a parsing strategy. GLL implemented with graph-structured stacks can and will produce a parse forest for an ambiguous grammar, LALR will manifest one of those parses for the same grammar.

To lean on that means you have hidden information: your grammar is BNF + LALR, or BNF + GLL, not just BNF. PEGs, by contrast, are always PEGs. What you see is what you get.

A grammar in PEG format will always give you one parse, and which parse is predictable. If you want a different parse, you have to rewrite it. Indeed, as I'm sure you know, your example grammar isn't valid in the original PEG formalism, which prohibits immediate left recursion. Automatic rewrites into an intermediate rule are the leading method of allowing it.

Ambiguity is a well-defined concept in grammars, and PEGs aren't.

I've noticed that a lot of CFG enthusiasts don't like this about PEGs. They consider it inelegant, unprincipled. Some of that is aesthetic, some is unfamiliarity, and some is sunk cost: none of it actually engages with the formal expressive power of PEGs, nor their ergonomics as a practical tool for development.

tomp · on Oct 22, 2020

Sure, I guess you're right, if you consider formal grammars as existing in their own abstract bubble removed from our reality.

I, on the other hand, am primarily interested in using grammars to parse programming languages, which are essentially a human form of communication (computers could use bitcode or lisp-style ASTs, no need for syntax).

I'm not an expert on PEGs, so this is copy-pasted from Wikipedia page on PEGs, I hope it's valid - the classical dangling else ambiguity.

     S ← 'if' C 'then' S 'else' S / 'if' C 'then' S

Now, there is an obvious ambiguity here, for a human reader (mediated, of course, by tabs). If you're using a formalism that wishes that ambiguity away, it just means that the formalism is a non-ideal one. What I like about LALR parser generators is, that they will explicitly alert you of this kind of human-level ambiguities.

samatman · on Oct 23, 2020

It's okay that you aren't familiar with PEGs. I am familiar with PEGs, and so I know just by reading that, which way the ambiguity resolves. So for me, a human reader, there is no obvious ambiguity.

It's like saying there's an ambiguity in "not a and b". Sure, if you don't know the precedence assigned to 'not' and 'and' in your language. But you're supposed to learn those things.

Your not knowing how PEGs work is a weak argument against using them.

scythe · on Oct 21, 2020

In my experience LPeg does a good job of being comprehensible for most reasonably-sized use cases. It can even parse the grammar of Lua itself. I've personally had not that much trouble translating BNF grammars to PEG, though the result is typically longer. It still satisfies the goal of separating parsing logic from data logic, which is a big step.

For more info on the issues with PEGs -- and a paper showing how to correctly translate general LL(1p) grammars to PEG -- see:

https://jeffreykegler.github.io/Ocean-of-Awareness-blog/indi...

wahern · on Oct 22, 2020

LPeg also has what it calls match-time captures (http://www.inf.puc-rio.br/~roberto/lpeg/#matchtime), which can be used to parse non-context free grammars like common TLV (tag, length, value) formats. For example, I've written a pure LPeg parser for parsing PKIX objects like X.509 certificates. Example edited code snippets with some high-level and low-level bits:

  -- returns DER object pattern that captures inner value
  local function Cobject(identifier, patt)
    local match

    if lpeg.type(patt) then
      match = function (s)
        return lpeg.match(patt * -P(1), s)
      end
    elseif type(patt) == "function" then
      match = patt
    elseif patt == nil then
      match = function (s)
        return s
      end
    else
      error(sformat("expected function, pattern or nil, got %s", type(patt)), 2)
    end

    return Cmt(identifier, function (s, pos)
      local n, pos = assert(unpacklength(s, pos))
      local s1 = s:sub(pos, pos + n - 1)
      pos = pos + n

      return (function (pos, v, ...)
        if v then
          return pos, v, ...
        else
          return false
        end
      end)(pos, match(s1))
    end)
  end

  local BIT_STRING = Cobject(P"\x03", function (s)
    local pad = s:byte(1) -- first octet is number of padding bits
    assert(pad == 0, "BIT STRING not octet aligned") -- we only support DER
    return s:sub(2)
  end)

  local IA5String = Cobject(P"\x16")

  local OID = function (oid)
    if oid then
      local s = packoid(pkix.txt2oid(oid))
      return P(sformat("\x06%s%s", packlength(#s), s)) * Cc(oid)
    else
      return Cobject(P"\x06", function (s)
        return assert(unpackoid(s))
       end)
    end
  end

  local SEQUENCE = function (patt)
    return Cobject(P"\x30", patt)
  end

  local TBSCertificate = SEQUENCE(Ct(
    Cg(Version, "version") *
    Cg(CertificateSerialNumber, "serialNumber") *
    Cg(AlgorithmIdentifier, "signature") *
    Cg(Name, "issuer") *
    Cg(Validity, "validity") *
    Cg(Name, "subject") *
    Cg(SubjectPublicKeyInfo, "subjectPublicKeyInfo") *
    Cg(UniqueIdentifier(1), "issuerUniqueID")^-1 *
    Cg(UniqueIdentifier(2), "subjectUniqueID")^-1 *
    Cg(Extensions, "extensions")^-1 *
    Cg(P(1)^1, "trash")^-1
  ))

  local Signature = BIT_STRING

  local Certificate = SEQUENCE(Ct(
    Cg(TBSCertificate, "tbsCertificate") * 
    Cg(AlgorithmIdentifier, "signatureAlgorithm") * 
    Cg(Signature, "signature")
  ))

specialist · on Oct 21, 2020

"where the semantics of a particular particle require running the parser, or worse"

Yup.

Consider the much maligned (apache) httpd.conf and makefile. Their code was the source of truth for correctness. No matter horrible terrible their syntax, they would have been sufferable if they had grammars.

Bespoke parsing of expressions benefits from having an implicit grammar, like this OC. But threshold for needing explicit grammars is pretty low.

Prime examples are all the data exchange formats. Descriptions of JSON and CSV are short and naively reasonable, but insufficient. As we've seen, any ambiguity that can happen will happen.

spullara · on Oct 21, 2020

ANTLR 4 is a joy to work with if folks on this thread haven't tried it.

ben509 · on Oct 21, 2020

> Why simpler is better and why you don't need a parser generator.

As far as I can see, this isn't fully answered, unless the claim is strictly limited to the question of need.

In my case, I certainly want a parser generator. I'm working on a language, and I did a very early version using a hand-rolled recursive descent parser.

Then I realized I wanted a syntax that was human friendly, so I graduated to megaparsec. That worked, it's an elegant way to describe a parser, but as a language gets complex you have to work in implicit orderings (magic, frankly) to handle the limits of recursive descent.

Finally, I realized I needed a spec and an implementation in sync with that spec, so I went with BNFC[2], and it even generates some pretty documentation[1] for me.

This is the classic debate over using a domain specific language vs. a general purpose language, so there's no hard answer one way or the other. For me, the deciding factor was I didn't want to write, test, etc. my own parser, and I think there being a canonical way for others to parse my language (in other languages as well) is helpful.

[1]: https://tenet-lang.org/spec-syn.html

[2]: https://bnfc.digitalgrammars.com/

dfabulich · on Oct 21, 2020

Right below the line you quote, the author makes three arguments:

> • It’s highly instructive, in a way that using a parser generator is not. To quote Feynman: “What I cannot create, I do not understand”

> • It’s an important skill: most real-world compilers use hand-written parsers because they provide more control over error handling, significant whitespace, etc.

> • It’s not actually difficult!

The exercises themselves are meant to justify these arguments.

I think the exercises conclusively demonstrate that #1 and #3 are true. If you can easily follow these exercises, and if, by the end, you feel like you've learned something useful, then it follows that hand-rolling a parser is instructive and not too difficult.

I strongly agree with the author that if you're developing a language for the first time, you shouldn't use a parser generator at first. You should hand-roll a parser first, and then, when you see the limitations of your hand-rolled parser, adopt a parser generator, now with full understanding of what the generator is doing for you (and not doing for you).

As for #2, the article demonstrates how to do error handling in a manner that would be a head-scratcher in Yacc.

You claim here that it was easier to develop a "human friendly" syntax with megaparsec than it was to do that in a hand-rolled recursive-descent parser. That could be true, but that's not my experience. To be convincing, you'd need to do what the author did, (even though you say the author didn't): you'd need to justify your argument with specific examples.

avmich · on Oct 21, 2020

> > • It’s highly instructive, in a way that using a parser generator is not.

Top-down parsers are nice, but even better would be not to have to write them each time you need to parse something. Also top-down parsers don't handle (well) some of useful grammar constructs, and it's tedious to remember always formulate grammars in a certain way. Next, grammars are for many people more convenient to work with than actual parser code. So parser generators have their place.

> To quote Feynman: “What I cannot create, I do not understand”

And this can be true for parser generators. After you've figured how to do that you could be tempted to never use external tools, and have a nice jump from, say, arbitrary CFG to a generalized LR, while controlling all the parts in between.

> • It’s not actually difficult!

ben509 · on Oct 22, 2020

I know people often read headlines and then instantly comment, but I really did read most of the post.

What I realized after writing most of my comment was that the author was primarily interested in explaining a recursive descent parser, and the arguments you quoted were simply a hook to get people interested.

That's why I phrased my objection to not _fully_ answering the question.

And it's a good question, so I thought the other side deserved some exploration.

> You should hand-roll a parser first, and then, when you see the limitations of your hand-rolled parser, adopt a parser generator, now with full understanding of what the generator is doing for you (and not doing for you).

Writing your own recursive descent parser only teaches you recursive descent. So, sure, you'll have a clear idea of what a parsec derivative is doing since that's also RD, but it won't help you understand what a LALR parser generated by yacc is doing.

The other problem is for someone to use your parser in another language, they have to port the whole thing and maintain that port as your language changes. Talking about "the first time you write it" and pedagogical uses is entirely fair, but it's only the beginning of the story.

> To be convincing, you'd need to do what the author did, (even though you say the author didn't): you'd need to justify your argument with specific examples.

Nope, never said the author didn't provide examples. To be clear, it's an excellent tutorial on how to write a RD parser.

I'm not prepared to rewrite a bunch of code in two styles, but you're welcome to take a look at the expression parser[1].

In this case, I didn't want to write a whole precedence scanner for expressions, and I wanted precedence to be clear to a reader.

So there's some nuance I didn't capture: in Haskell parsing, writing your own RD parser starts to look like parsec because it's such a natural expression of the problem.

If I was going to write my own parsec, I could still probably use an existing combinator[3] because the parsec model is so generic. That's why I wanted to use that to do a more human readable, and thus more complex, syntax.

But, as I mentioned above, recursive descent kinda sucks. As an example, take the parsing for the left-hand side of an assignment[2]. Sometimes I have 'try' calls, other times I don't. Sometimes the ordering around alternatives (the <|> operator) matters, sometimes it doesn't.

I know in abstract why it works one way or another. But, honestly, most of that is in there because it got tests to pass. My interest is in writing a language, not a parser.

And that's really why I switched from megaparsec to a parser generator. Once I got my grammar to be (reasonably) unambiguous, my source was just the plain, trivially readable BNF rules, and I nuked those stupid tests.

[1]: https://gitlab.com/contravariance/tenet-haskell/-/blob/0640c...

[2]: https://gitlab.com/contravariance/tenet-haskell/-/blob/0640c...

[3]: https://hackage.haskell.org/package/parser-combinators-1.2.1...

mannykannot · on Oct 21, 2020

Once, I wanted a parser, and being the sort of person who thinks there is value in seeing what other people have done before, I took a look at parser generators.

The learning curve seemed long and steep, much more than seemed necessary for my modest requirements, so I rolled my own.

I got a parser out of the exercise, but also an appreciation for why the tools exist, and a new-found motivation for learning how to use them.

hardwaregeek · on Oct 21, 2020

How do you handle good error messages or error recovery? I've played around with parser generators but I never figured out how to do either in a satisfactory fashion. Granted, my hand written parser doesn't do error recovery very well either.

ben509 · on Oct 22, 2020

I'm not entirely settled on that.

My working approach is if you're writing a compiler, it wants to have an unambiguous grammar and shouldn't even attempt recovery. If input doesn't parse, the compiler can recommend (or just run) the linter. That keeps the common case fast and simple.

The linter/fixer can have a relaxed syntax specifically designed to handle messy code and suggest corrections. That comes from a philosophy of treating error correction and user assistance as a separate task.

But that's an approach I'm taking because I want to get a reference implementation together as quickly as possible. It's definitely not how modern IDEs work.

JetBrains GrammarKit uses a PEG[1] because they need strong support for error recovery[2] using hints. Another interesting library is Tree-sitter[3]; it does incremental parsing keeping an AST constantly up to date for you.

Relevant to this whole discussion, GvR wrote a series on PEG parsers[4] in which he starts out by writing one by hand and then shows how to write one that accepts a grammar.

[1]: https://github.com/JetBrains/Grammar-Kit

[2]: https://github.com/JetBrains/Grammar-Kit#attributes-for-erro...

[3]: https://tree-sitter.github.io/tree-sitter/

[4]: https://medium.com/@gvanrossum_83706/peg-parsing-series-de5d...

ltratt · on Oct 22, 2020

Alas, most parser generators don't have very good error recovery (and some have such terrible error recovery that I think it's worse than not having any!).

It turns out that this isn't inevitable: there's been a long strand of research on decent error recovery for LR parsers, at least, but it needed a bit of a refresh to be practical. If you'll forgive the blatant self promotion, we tackled this problem in https://soft-dev.org/pubs/html/diekmann_tratt__dont_panic/ which is implemented in our Rust parsing system https://github.com/softdevteam/grmtools/. It won't beat the very best hand-written error recovery routines, but it's often not far behind.

epistasis · on Oct 21, 2020

There's been a ton of code that I've written that in retrospect, I should never have written.

But I've never regretted when I wrote a parser. Maybe because it takes such a large activation energy to get over the hump and actually do it, that I only do it when absolutely necessary. But it always seems easier and more useful than I thought before doing it.

Now that this post has prompted that realization, I wonder if it will stay true...

pianoben · on Oct 22, 2020

I have _absolutely_ regretted writing a parser by hand. Once I replaced it with an ANTLR grammar and a comparatively-trivial bit of glue, my thrift parser became not only easier to refactor but more reliable.

My hand-written recursive descent parser was a perennial source of bugs, where ANTLR has yielded almost none. Some fiddly things I was doing with comments became much easier, if not effortless.

I highly, HIGHLY recommend at least starting with a compiler-generator like ANTLR. The "activation cost" of your project will be much lower, and you may find that you never actually _need_ the level of control you give up.

wrs · on Oct 21, 2020

One reason to use a parser generator for a new language is that it will tell you that your language is inherently ambiguous or otherwise broken in a way you didn’t realize. A hand-coded parser tends to just codify your bad assumptions.

It can actually be a good idea to maintain a YACC (or other) grammar for your language just to run the tool as a kind of “grammar linter”, even if you’re going to write the parser by hand.

pklausler · on Oct 21, 2020

If you construct your parser in some kind of declarative fashion (e.g., a parser combinator DSL), you can do that kind of error checking directly on the same spec that you compile to produce your parser.

thequux · on Oct 21, 2020

Most parser combinator DSLs use PEGs to parse, which don't actually prevent ambiguity, but instead hide it. I strongly recommend using something that compiles your declarative grammar into LL(k) or LALR(k)

UK-Al05 · on Oct 22, 2020

Do they hide or do they just apply order to choices. Making it impossible to be ambiguous compared to the CFG definition?

It's only sweeping under the rug, if you look at PEG under CFG rules. PEG is not CFG.

I think practice of writing industrial software that is better.

WorldMaker · on Oct 22, 2020

I do sometimes wonder if a lot of these arguments themselves would be less ambiguous if PEGs (or some other name of them) were formally added to Chomsky's Hierarchy of Languages between Regular Languages and the languages expressed by CFGs. Though I think it would take someone with a much more rigorous math background than myself to make the case formal enough to get it accepted.

(If it helps, and it may be a red herring, the dualism between PEGs and Parser Combinators has me thinking it's a Category Theory related "step" in the hierarchy. "Deterministic Monadic Compositions" in a reflection of DFA/NFA duals to Regular Languages might imply "Monadic Languages" as a possible name? Again, my math background is definitely not formal enough here to make actual suggestions, but maybe it sparks an idea for someone else.)

chrisseaton · on Oct 22, 2020

PEGs absolutely do prevent ambiguity - they don't 'hide' it - I don't even know what that would mean?

There is exactly one way a PEG can parse - there is no ambiguity at all.

chrisseaton · on Oct 22, 2020

If you define your language using a recursive descent parser then it can’t be ambiguous.

wrs · on Oct 22, 2020

Yes, exactly — that means you’re resolving ambiguities whether or not you’re aware of them (“codify your bad assumptions”).

chrisseaton · on Oct 22, 2020

You're not aware of them... because they never exist. A language specified by a recursive descent parser is never ambiguous in the first place. There's nothing to resolve.

I don't know where 'codify your bad assumptions' comes into it? What assumptions? How are they codified without you knowing?

wrs · on Oct 22, 2020

The language your parser parses may not be the language you had in mind. Indeed, the language you had in mind may not actually exist. It’s easy to operate based only on examples, and think you know what your language is, when in fact there are cases you didn’t consider. Your parser will resolve the ambiguity (because it has to), but if the tool had told you about the ambiguity, you might have redesigned the language. The result is often that one of your users will discover the ambiguity instead, and then it may be too late.

As an example too well-known to actually occur, suppose you parse “if x then if y then x else z” with the else clause associated to the first if statement, because that’s just how you happened to write the parser. It’s not ambiguous, but your users won’t be happy when they find out the unambiguous rule.

chrisseaton · on Oct 22, 2020

> The result is often that one of your users will discover the ambiguity instead, and then it may be too late.

But there are no ambiguities in recursive descent! They won't discover them because... they don't exist! It's literally impossible.

> As an example too well-known to actually occur, suppose you parse...

Great example - and the good thing about recursive descent means it's impossible to write this ambiguously - you must prefer one or the other.

> It’s not ambiguous

Correct. So why are you telling me the problems of ambiguity?

> but your users won’t be happy when they find out the unambiguous rule

Why won't they be happy? You can tell them exactly how their code is going to be parsed. I thought ambiguities was bad but now you're complaining about ambiguity as well?!

wrs · on Oct 23, 2020

We seem to be talking past each other. :) Of course there are no ambiguities in any parser (unless it returns multiple parses, which would hardly be practical). But that’s almost always because the designer had to resolve some ambiguities present in the grammar.

The question is, did the designer resolve them deliberately or accidentally? And were they resolved in an ergonomic way?

The grammar fragment “expr :: IF expr THEN expr ELSE expr” is ambiguous. If I write a statement like the example, I expect (from 60 years of language tradition and simple ergonomics) a language author to choose to resolve that ambiguity by associating an ELSE with the nearest IF. Telling your puzzled users “it’s not ambiguous, it always goes with the farthest IF!” isn’t going to make them happier. It also won’t work to say “well, + has higher precedence than /, but that’s OK, it always does that”. Those are just bugs in the language design caused by a bad resolution of ambiguity.

If you use a tool to flag the ambiguities in your grammar, then you can be sure all the resolutions in your parser are deliberate and not accidental.

chrisseaton · on Oct 23, 2020

> But that’s almost always because the designer had to resolve some ambiguities present in the grammar.

Not if they started with a grammar that didn't have any ambiguity in the first place.

I think you're coming from the angle that you always start with a CFG, resolve ambiguity, then write a parser.

Imagine that I never write a CFG for my language. No CFG exists! Instead - I start by writing a PEG, and then I write a recursive descent parser from that. At no point in this process have I had to resolve ambiguity. I didn't start with an ambiguous CFG and then write a PEG from it. I started with a PEG. It's never been ambiguous, and never will be ambiguous. There's no ambiguity.

> The grammar fragment “expr :: IF expr THEN expr ELSE expr” is ambiguous.

Right.... but I wouldn't write that because I'm not starting with a CFG I'm starting with a PEG.

> Telling your puzzled users “it’s not ambiguous, it always goes with the farthest IF!” isn’t going to make them happier.

I can't understand this - if I give them a simple well-defined rule that tells them what the code means they'll be happy. What do you think they'd want instead? No rule? A badly defined rule?

yencabulator · on Oct 23, 2020

> The language your parser parses may not be the language you had in mind.

This can happen even if you write a formal specification.

tofflos · on Oct 21, 2020

> Enter an expression on the left and see the parse tree change!

I don't get it. The left input field is uneditable in Firefox, Edge and Chrome. Navigation is broken across the board. Even copying and pasting the quote above somehow included most of the page even though I only selected that one sentence.

This looks very cool and I would love to try it out! :(

blcArmadillo · on Oct 21, 2020

Yeah I think something is broken because at least in chrome when you try to type in the left box the following exception keeps being thrown in the console

  VM463:82 Uncaught TypeError: Cannot read property 'startContainer' of undefined
      at HTMLPreElement.eval (eval at run (eval at runScriptElement (octopus-2.js:525)), <anonymous>:82:40)

tiarkrompf · on Oct 21, 2020

I just pushed a fix for Chrome (apparently it doesn't report affected selection ranges for input events). Hope it works now!

makeworld · on Oct 22, 2020

I can't type on the left in Firefox still. Using 81.0.2.

tofflos · on Oct 21, 2020

Thanks! It works for me now in Edge (Chromium).

LeonB · on Oct 21, 2020

it works for me now on Chrome.

tiarkrompf · on Oct 21, 2020

Author here - happy to answer questions, as always.

Thanks also for feedback on the format of the article. It's a bit of an experiment on how to present dense information effectively. Some more rationale here: https://tiarkrompf.github.io/notes/?/octopus-notes/

LeonB · on Oct 21, 2020

Octopus notes is an interesting idea. Thanks for the link on it.

And I do like to write a lot of little parsers. I worked with someone once who would somehow transform every business problem into “we need to write a compiler”. Sounds crazy but he achieved brilliant results and has gone on to great things.

musicale · on Oct 21, 2020

What's the best way to handle unary "-" and other unary operators?

I'd like to be able to write expressions like "-2^-(2+2)" or "a cos b + a sin b".

For "-2^-(2+2)" note that exponentiation has higher precedence than negation.

musicale · on Oct 22, 2020

Thanks for the other replies, but I was asking the original author for a suggestion using the presented framework, rather than an alternate algorithm or approach from someone else. As presented, the approach didn't seem to handle unary operators.

I probably should have noted that I am already familiar with Pratt parsing, which seems like something that isn't actually brain-dead simple and obvious in the same way (which is why it was worth writing a paper about in the 1970s.)

Hoping for a reply from the original author to recommend a simple approach to add unary operators.

tiarkrompf · on Oct 22, 2020

Great question! Unary operators are really simple to add: you just look for the operator symbol first thing at the right level of precedence. Same idea as "if (peek == '(') ..." for parentheses, but outside the code that deals with '*' and '^' (if you want those to bind more strongly).

thysultan · on Oct 22, 2020

Take the current + next node

i.e: parse_node + parse_peek

When parse_node is an operator you know it is the "-2" in "-2^-2(2+2)" and when parse_peek is the operator it is "-(2+2)" in the same. An example of this:

- https://github.com/thysultan/Ally/blob/8ba0b4de7ab104ceae54d...

musicale · on Oct 22, 2020

1) This reference is cryptic

2) I was asking the original author for a simple extension to the presented appraoch

myk9001 · on Oct 21, 2020

Take a look at "Parsing expressions by precedence climbing"[1] by Eli Bendersky.

See the "Other resources" section for other approaches to this problem.

1: https://eli.thegreenplace.net/2012/08/02/parsing-expressions...

musicale · on Oct 22, 2020

1) Like the original article, this leaves unary operators as an exercise for the reader (though it does note so explicitly and provides a small hint)

2) This looks more like Pratt parsing vs. the approach described in the article

3) I was asking the original author

cwp · on Oct 21, 2020

If you're writing a compiler, sure, a hand-written parser is probably better than a generated parser.

However, I'd also make the related claim that we should be much more ready to write formal parsers for other use cases. Far too often developers implement a "pile of regular expressions" where a parser—even a generated parser-would be a better choice.

prezjordan · on Oct 21, 2020

Aside (sorry!) but https://tiarkrompf.github.io/notes is really cool and it's fun to see this format

jraph · on Oct 21, 2020

Wow, thank you very much for this link!

This is very inspiring.

First, there is an impressive overlap with what I like and what I have been doing. The notes on the piano, the parser, on GraphViz and on React all hit very close if not exactly things I've thought about or programmed.

Second, the means of presenting things is awesome too. I get lost very easily so there are probably ways to improve but this is very clever and interesting. The creativity and pedagogy here is incredible at all levels.

I am very impressed.

I need to keep an eye on this. I wish there was an RSS feed. I'll contact him.

tiarkrompf · on Oct 21, 2020

Thanks - very glad to hear. RSS should be doable

brundolf · on Oct 21, 2020

The only annoying thing is that the scroll bar is in the middle of my screen. I wish they'd max-widthed the text while leaving the scrollable element full-width

ofrzeta · on Oct 21, 2020

The other annoying thing is the manipulation of the browser history :) In my opinition it should keet its hands off it.

mamcx · on Oct 21, 2020

I like the bit about CPS, but wish it show how do the last part without having the async machinery in the host language.

What I have a hard time to get is how implement fully delimited continuations. All the examples I know reuse the machinery of the host so is hard to translate (in my case, to Rust)

awinter-py · on Oct 21, 2020

those scroll-matched curly braces are something

eyelidlessness · on Oct 21, 2020

Considering that something as simple and limited as JSON has been a source of security vulnerabilities from ambiguities in the spec, thanks but nope. Declarative definitions of a grammar help identify those ambiguities... and if the grammar is defined separately from its interpretation, it opens the door to invisible ambiguities that can become another vector for vulnerabilities.

Unless you have provable implementations (great if you do!), declarative grammars meaningfully limit points of failure.

lifthrasiir · on Oct 22, 2020

Spec ambiguities of JSON mostly come from the insufficient description of data model and well-formedness and not from the syntax itself. Say, to this day (including RFC 8259) duplicate keys from the JSON object are not explicitly forbidden, even though most applications require that. Probably the only issue arisen from the JSON syntax proper would be the treatment of line and paragraph separators.

ssl232 · on Oct 22, 2020

And my number one gripe, that you can't add a trailing comma to the last list item. Very annoying when streaming JSON data.

jtolmar · on Oct 21, 2020

I'd highly recommend looking at a few toy parser projects, like this and Crafting Interpreters (which are both recursive decent, with very different styles), writing however much of a typical parser you think makes sense "for fun," and then saving that somewhere you can get at it.

Writing a parser from scratch is kind of an obnoxious hurdle, but if you have some code you're already familiar with that you can copy-paste in and modify to your purposes, that's a lot more accessible.

(Also hats off to this author for making a parser so damn terse.)

swyx · on Oct 22, 2020

what is "recursive decent"?

newlikeice · on Oct 22, 2020

man can dig hole. while digging one hole he finds two holes. he Diggs the one hole till he finds the bottom. climbs up to where he found the two the proceeds down the other hole. till he finds two more holes and then digs to the bottom of the first hole he finds until he hits bottom there. he descends until he hits bottom or digs through the earth. one hole at a time.

the digging is just the way your language constructs itself into different elements spread out like directories and sub-directories.

swyx · on Oct 22, 2020

ah. i was thrown off by the typo. i thought you were complimenting their decency.

mcluck · on Oct 21, 2020

I don't want to be that guy but I originally thought the CSS was broken. Harsh black on harsh white, everything squeezed in to the left third of my screen, and an unused scrollbar on the bottom.

That being said, I love the information. I don't care for the workflow that comes with the most popular parser generators and I inevitably want more control over my parser as I start adding error handling, etc. It's easier to just write it by hand in the first place

contravariant · on Oct 21, 2020

There are unused scrollbars everywhere. It's quite an interesting example of scrollbar blindness [1].

[1]:https://news.ycombinator.com/item?id=24293421

millstone · on Oct 21, 2020

I think the design is great. Simple and easy to read.

Supermancho · on Oct 21, 2020

I'm not quite sure why they went with the borked "half the screen is empty" look. Ok, it's easier to read, but that negative space is bad design, regardless.

megous · on Oct 22, 2020

Because... ? https://megous.com/dl/tmp/da01875b777013e9.png

toolz · on Oct 22, 2020

I found the design to be very intuitive even though I'd never encountered that type of information flow before. The curly braces didn't seem to serve any purpose as far as I could tell, but the negative space was great. I knew exactly where the content was going to pop up before I clicked the "link". No jarring resize, just plain ole empty space not taking your attention away from the main content until you want see content in there.

mcluck · on Oct 21, 2020

Different strokes for different folks, I suppose. I found it to be confusing and painful

ThePhysicist · on Oct 22, 2020

I'd recommend reading the "Elkhound" paper [1], which introduces a GLR parser that's really easy to implement and very powerful, they use it in the paper to parse a large subset of C++. GLR parsers are so much more powerful and intuitive than recursive descent parser (I find), and they often make grammars easier to read and understand.

That said if you're building a real programming language it makes sense to build the parser by hand as it often runs faster. For example, my GLR Python parsers clocked in at around 100.000-300.000 lines of Python code per second, while the regular Python parser could do 5-10 times as much per second (including tokenizing, parsing and AST generation). Looking at the amount of Python code that is parsed every day, this is a significant difference.

[1] https://people.eecs.berkeley.edu/~necula/Papers/elkhound_cc0...

maest · on Oct 21, 2020

The framing in this page is broken on Firefox Android. The last couple of lines can't be scrolled into view.

leonliang · on Oct 22, 2020

As a previous student in Professor Rompf's compiler class, I had the best learning experience while writing the parser for a Scala-like language. The parser logic has since influenced not only how I code, also how I think.

Const-me · on Oct 21, 2020

Did something similar a few times when I needed that.

Here’s an open source example: https://github.com/Const-me/vis_avs_dx/tree/master/avs_dx/Dx... It’s mostly in C++, just a few offline pieces in C# in T4 templates (they produce C++ code).

Terr_ · on Oct 22, 2020

It may be worth distinguishing between:

1. Having a formal grammar definition as the authoritative spec

2. Using the formal grammar definition to generate your parser

#1 is quite valuable on its own, even if you don't want to do #2. It can be analyzed for ambiguities and inconsistencies, and you avoid the tar-pit of "the original implementation is the spec".

Cloudef · on Oct 22, 2020

Whenever someone mentions parsers or parser combinators, the only thing that comes to my mind is ragel that's really suited for generating any sort of parsers http://www.colm.net/open-source/ragel/

Cloudef · on Oct 22, 2020

https://git.cloudef.pw/escpos2raster.git/tree/src/escpos/par... here is example, where I use ragel to parse ESC/POS printer protocol. Software that allows raster printers to print ESC/POS stuff.

kmstout · on Oct 22, 2020

A long time ago I did a deep-dive on recursive descent parsing and ended up writing a parser generator [1].

[1] https://github.com/reindeereffect/tools-from-blog/tree/maste...

kenoph · on Oct 22, 2020

How about no? Parsers are one of the main sources of bugs in software. You should write parsers manually only when you have a very good reason to. Go ask any of your system security friends.

tomcam · on Oct 22, 2020

I find it handy to write out the grammar for a parser generator to check for problems, then handwrite the parser once it passes.