An alternate title: "Just Shotgun Parse Your Inputs".
This isn't bad advice across the board, but it is bad advice for many applications.
This is how you end up with context-sensitive formats, or ones where the semantics of a particular particle require running the parser, or worse. It exposes a rich surface area for exploitation with adversarial inputs.
That said, the advantages cited at the front of the article are all real: error recovery and decent error messages are unsolved problems with existing parser generators, although ANTLR is pretty good at it.
But, especially if you're developing the data format rather than just implementing it, I strongly suggest using something like Instaparse which can generate a parser directly from a grammar. Once this is solidified, it's reasonable to hand-roll something, but even there, I suggest using a nice parser combinator library, like hammer or Nom.
This provides more discipline: as long as you're coloring inside the lines, you'll end up with a parser which actually implements your spec. When you start adding functions to provide context, recover from errors, and provide meaningful feedback in the form of error messages for unexpected inputs, you'll know what you're doing: a classic recursive-descent parser can't provide the same conceptual separation between the parsing logic and the logic to support ergonomics or one-pass compiling or whatever add-ons you're putting in there.
I agree with everything you say, except your advice to use a parser combinator library, because most implement PEGs. Hammer is, of course, an exception to this, as long as you call `h_compile` to tell it to use a different backend.
Why not use PEGs? In short, they don't actually remove ambiguity from your grammar, but rather hide it in ways that are difficult to reason about. You're still dependent on code as a definition of the language that you accept, and worse, that code is very fragile in the face of refactoring. Further, the lexical analysis step is generally the source of most ambiguity, and most parser combinator libraries make it extremely inconvenient to parse something that isn't a character stream. (Hammer is actually worse than most here; it's not possible to use it to parse anything other than streams of bytes, though perhaps it can be excused because of the limitations of C)
I firmly disagree about the value of PEGs (and hence combinators) as a formalism that's ambiguous or hard to reason about. PEGs are as well-specified as CFGs, they just work in a different way. One of the strengths of PEGs is that they're never ambiguous. That does mean that order of alternates is important, and sometimes you do have to fiddle with that order when translating something like ABNF. But what you get in return is never having to deal with a parse forest, let alone punting on that problem by just reifying the greedy parse as the correct one.
I've written a tool that converts a declarative specification of a PEG grammar into a parser, although I haven't released it yet. It works great, I've used it on a number of real-world structured data formats.
Edit: cut out the first paragraph after checking the user name. Hi TQ.
"PEGs are never ambiguous" is equivalent to solving a problem by proclaiming the wrong solution to be correct.
In that sense, LALR parser generators (e.g. yacc) are also never ambiguous, because even if your grammar is ambiguous in the formal sense, yacc will produce a working parser with well-defined resolution of conflicts...
But the reality is,
E = E `+` E
is intrinsically ambiguous, no matter how you want to spin it... the only difference is, that LALR generators point that out, whereas PEG generators sweep it under the rug.
I don't agree, as I've said, because PEG is a declarative formalism, and LALR is a parsing strategy. GLL implemented with graph-structured stacks can and will produce a parse forest for an ambiguous grammar, LALR will manifest one of those parses for the same grammar.
To lean on that means you have hidden information: your grammar is BNF + LALR, or BNF + GLL, not just BNF. PEGs, by contrast, are always PEGs. What you see is what you get.
A grammar in PEG format will always give you one parse, and which parse is predictable. If you want a different parse, you have to rewrite it. Indeed, as I'm sure you know, your example grammar isn't valid in the original PEG formalism, which prohibits immediate left recursion. Automatic rewrites into an intermediate rule are the leading method of allowing it.
Ambiguity is a well-defined concept in grammars, and PEGs aren't.
I've noticed that a lot of CFG enthusiasts don't like this about PEGs. They consider it inelegant, unprincipled. Some of that is aesthetic, some is unfamiliarity, and some is sunk cost: none of it actually engages with the formal expressive power of PEGs, nor their ergonomics as a practical tool for development.
Sure, I guess you're right, if you consider formal grammars as existing in their own abstract bubble removed from our reality.
I, on the other hand, am primarily interested in using grammars to parse programming languages, which are essentially a human form of communication (computers could use bitcode or lisp-style ASTs, no need for syntax).
I'm not an expert on PEGs, so this is copy-pasted from Wikipedia page on PEGs, I hope it's valid - the classical dangling else ambiguity.
S ← 'if' C 'then' S 'else' S / 'if' C 'then' S
Now, there is an obvious ambiguity here, for a human reader (mediated, of course, by tabs). If you're using a formalism that wishes that ambiguity away, it just means that the formalism is a non-ideal one. What I like about LALR parser generators is, that they will explicitly alert you of this kind of human-level ambiguities.
It's okay that you aren't familiar with PEGs. I am familiar with PEGs, and so I know just by reading that, which way the ambiguity resolves. So for me, a human reader, there is no obvious ambiguity.
It's like saying there's an ambiguity in "not a and b". Sure, if you don't know the precedence assigned to 'not' and 'and' in your language. But you're supposed to learn those things.
Your not knowing how PEGs work is a weak argument against using them.
In my experience LPeg does a good job of being comprehensible for most reasonably-sized use cases. It can even parse the grammar of Lua itself. I've personally had not that much trouble translating BNF grammars to PEG, though the result is typically longer. It still satisfies the goal of separating parsing logic from data logic, which is a big step.
For more info on the issues with PEGs -- and a paper showing how to correctly translate general LL(1p) grammars to PEG -- see:
LPeg also has what it calls match-time captures (http://www.inf.puc-rio.br/~roberto/lpeg/#matchtime), which can be used to parse non-context free grammars like common TLV (tag, length, value) formats. For example, I've written a pure LPeg parser for parsing PKIX objects like X.509 certificates. Example edited code snippets with some high-level and low-level bits:
-- returns DER object pattern that captures inner value
local function Cobject(identifier, patt)
local match
if lpeg.type(patt) then
match = function (s)
return lpeg.match(patt * -P(1), s)
end
elseif type(patt) == "function" then
match = patt
elseif patt == nil then
match = function (s)
return s
end
else
error(sformat("expected function, pattern or nil, got %s", type(patt)), 2)
end
return Cmt(identifier, function (s, pos)
local n, pos = assert(unpacklength(s, pos))
local s1 = s:sub(pos, pos + n - 1)
pos = pos + n
return (function (pos, v, ...)
if v then
return pos, v, ...
else
return false
end
end)(pos, match(s1))
end)
end
local BIT_STRING = Cobject(P"\x03", function (s)
local pad = s:byte(1) -- first octet is number of padding bits
assert(pad == 0, "BIT STRING not octet aligned") -- we only support DER
return s:sub(2)
end)
local IA5String = Cobject(P"\x16")
local OID = function (oid)
if oid then
local s = packoid(pkix.txt2oid(oid))
return P(sformat("\x06%s%s", packlength(#s), s)) * Cc(oid)
else
return Cobject(P"\x06", function (s)
return assert(unpackoid(s))
end)
end
end
local SEQUENCE = function (patt)
return Cobject(P"\x30", patt)
end
local TBSCertificate = SEQUENCE(Ct(
Cg(Version, "version") *
Cg(CertificateSerialNumber, "serialNumber") *
Cg(AlgorithmIdentifier, "signature") *
Cg(Name, "issuer") *
Cg(Validity, "validity") *
Cg(Name, "subject") *
Cg(SubjectPublicKeyInfo, "subjectPublicKeyInfo") *
Cg(UniqueIdentifier(1), "issuerUniqueID")^-1 *
Cg(UniqueIdentifier(2), "subjectUniqueID")^-1 *
Cg(Extensions, "extensions")^-1 *
Cg(P(1)^1, "trash")^-1
))
local Signature = BIT_STRING
local Certificate = SEQUENCE(Ct(
Cg(TBSCertificate, "tbsCertificate") *
Cg(AlgorithmIdentifier, "signatureAlgorithm") *
Cg(Signature, "signature")
))
"where the semantics of a particular particle require running the parser, or worse"
Yup.
Consider the much maligned (apache) httpd.conf and makefile. Their code was the source of truth for correctness. No matter horrible terrible their syntax, they would have been sufferable if they had grammars.
Bespoke parsing of expressions benefits from having an implicit grammar, like this OC. But threshold for needing explicit grammars is pretty low.
Prime examples are all the data exchange formats. Descriptions of JSON and CSV are short and naively reasonable, but insufficient. As we've seen, any ambiguity that can happen will happen.
> Why simpler is better and why you don't need a parser generator.
As far as I can see, this isn't fully answered, unless the claim is strictly limited to the question of need.
In my case, I certainly want a parser generator. I'm working on a language, and I did a very early version using a hand-rolled recursive descent parser.
Then I realized I wanted a syntax that was human friendly, so I graduated to megaparsec. That worked, it's an elegant way to describe a parser, but as a language gets complex you have to work in implicit orderings (magic, frankly) to handle the limits of recursive descent.
Finally, I realized I needed a spec and an implementation in sync with that spec, so I went with BNFC[2], and it even generates some pretty documentation[1] for me.
This is the classic debate over using a domain specific language vs. a general purpose language, so there's no hard answer one way or the other. For me, the deciding factor was I didn't want to write, test, etc. my own parser, and I think there being a canonical way for others to parse my language (in other languages as well) is helpful.
Right below the line you quote, the author makes three arguments:
> • It’s highly instructive, in a way that using a parser generator is not. To quote Feynman: “What I cannot create, I do not understand”
> • It’s an important skill: most real-world compilers use hand-written parsers because they provide more control over error handling, significant whitespace, etc.
> • It’s not actually difficult!
The exercises themselves are meant to justify these arguments.
I think the exercises conclusively demonstrate that #1 and #3 are true. If you can easily follow these exercises, and if, by the end, you feel like you've learned something useful, then it follows that hand-rolling a parser is instructive and not too difficult.
I strongly agree with the author that if you're developing a language for the first time, you shouldn't use a parser generator at first. You should hand-roll a parser first, and then, when you see the limitations of your hand-rolled parser, adopt a parser generator, now with full understanding of what the generator is doing for you (and not doing for you).
As for #2, the article demonstrates how to do error handling in a manner that would be a head-scratcher in Yacc.
You claim here that it was easier to develop a "human friendly" syntax with megaparsec than it was to do that in a hand-rolled recursive-descent parser. That could be true, but that's not my experience. To be convincing, you'd need to do what the author did, (even though you say the author didn't): you'd need to justify your argument with specific examples.
> > • It’s highly instructive, in a way that using a parser generator is not.
Top-down parsers are nice, but even better would be not to have to write them each time you need to parse something. Also top-down parsers don't handle (well) some of useful grammar constructs, and it's tedious to remember always formulate grammars in a certain way. Next, grammars are for many people more convenient to work with than actual parser code. So parser generators have their place.
> To quote Feynman: “What I cannot create, I do not understand”
And this can be true for parser generators. After you've figured how to do that you could be tempted to never use external tools, and have a nice jump from, say, arbitrary CFG to a generalized LR, while controlling all the parts in between.
I know people often read headlines and then instantly comment, but I really did read most of the post.
What I realized after writing most of my comment was that the author was primarily interested in explaining a recursive descent parser, and the arguments you quoted were simply a hook to get people interested.
That's why I phrased my objection to not _fully_ answering the question.
And it's a good question, so I thought the other side deserved some exploration.
> You should hand-roll a parser first, and then, when you see the limitations of your hand-rolled parser, adopt a parser generator, now with full understanding of what the generator is doing for you (and not doing for you).
Writing your own recursive descent parser only teaches you recursive descent. So, sure, you'll have a clear idea of what a parsec derivative is doing since that's also RD, but it won't help you understand what a LALR parser generated by yacc is doing.
The other problem is for someone to use your parser in another language, they have to port the whole thing and maintain that port as your language changes. Talking about "the first time you write it" and pedagogical uses is entirely fair, but it's only the beginning of the story.
> To be convincing, you'd need to do what the author did, (even though you say the author didn't): you'd need to justify your argument with specific examples.
Nope, never said the author didn't provide examples. To be clear, it's an excellent tutorial on how to write a RD parser.
I'm not prepared to rewrite a bunch of code in two styles, but you're welcome to take a look at the expression parser[1].
In this case, I didn't want to write a whole precedence scanner for expressions, and I wanted precedence to be clear to a reader.
So there's some nuance I didn't capture: in Haskell parsing, writing your own RD parser starts to look like parsec because it's such a natural expression of the problem.
If I was going to write my own parsec, I could still probably use an existing combinator[3] because the parsec model is so generic. That's why I wanted to use that to do a more human readable, and thus more complex, syntax.
But, as I mentioned above, recursive descent kinda sucks. As an example, take the parsing for the left-hand side of an assignment[2]. Sometimes I have 'try' calls, other times I don't. Sometimes the ordering around alternatives (the <|> operator) matters, sometimes it doesn't.
I know in abstract why it works one way or another. But, honestly, most of that is in there because it got tests to pass. My interest is in writing a language, not a parser.
And that's really why I switched from megaparsec to a parser generator. Once I got my grammar to be (reasonably) unambiguous, my source was just the plain, trivially readable BNF rules, and I nuked those stupid tests.
Once, I wanted a parser, and being the sort of person who thinks there is value in seeing what other people have done before, I took a look at parser generators.
The learning curve seemed long and steep, much more than seemed necessary for my modest requirements, so I rolled my own.
I got a parser out of the exercise, but also an appreciation for why the tools exist, and a new-found motivation for learning how to use them.
How do you handle good error messages or error recovery? I've played around with parser generators but I never figured out how to do either in a satisfactory fashion. Granted, my hand written parser doesn't do error recovery very well either.
My working approach is if you're writing a compiler, it wants to have an unambiguous grammar and shouldn't even attempt recovery. If input doesn't parse, the compiler can recommend (or just run) the linter. That keeps the common case fast and simple.
The linter/fixer can have a relaxed syntax specifically designed to handle messy code and suggest corrections. That comes from a philosophy of treating error correction and user assistance as a separate task.
But that's an approach I'm taking because I want to get a reference implementation together as quickly as possible. It's definitely not how modern IDEs work.
JetBrains GrammarKit uses a PEG[1] because they need strong support for error recovery[2] using hints. Another interesting library is Tree-sitter[3]; it does incremental parsing keeping an AST constantly up to date for you.
Relevant to this whole discussion, GvR wrote a series on PEG parsers[4] in which he starts out by writing one by hand and then shows how to write one that accepts a grammar.
Alas, most parser generators don't have very good error recovery (and some have such terrible error recovery that I think it's worse than not having any!).
It turns out that this isn't inevitable: there's been a long strand of research on decent error recovery for LR parsers, at least, but it needed a bit of a refresh to be practical. If you'll forgive the blatant self promotion, we tackled this problem in https://soft-dev.org/pubs/html/diekmann_tratt__dont_panic/ which is implemented in our Rust parsing system https://github.com/softdevteam/grmtools/. It won't beat the very best hand-written error recovery routines, but it's often not far behind.
There's been a ton of code that I've written that in retrospect, I should never have written.
But I've never regretted when I wrote a parser. Maybe because it takes such a large activation energy to get over the hump and actually do it, that I only do it when absolutely necessary. But it always seems easier and more useful than I thought before doing it.
Now that this post has prompted that realization, I wonder if it will stay true...
I have _absolutely_ regretted writing a parser by hand. Once I replaced it with an ANTLR grammar and a comparatively-trivial bit of glue, my thrift parser became not only easier to refactor but more reliable.
My hand-written recursive descent parser was a perennial source of bugs, where ANTLR has yielded almost none. Some fiddly things I was doing with comments became much easier, if not effortless.
I highly, HIGHLY recommend at least starting with a compiler-generator like ANTLR. The "activation cost" of your project will be much lower, and you may find that you never actually _need_ the level of control you give up.
One reason to use a parser generator for a new language is that it will tell you that your language is inherently ambiguous or otherwise broken in a way you didn’t realize. A hand-coded parser tends to just codify your bad assumptions.
It can actually be a good idea to maintain a YACC (or other) grammar for your language just to run the tool as a kind of “grammar linter”, even if you’re going to write the parser by hand.
If you construct your parser in some kind of declarative fashion (e.g., a parser combinator DSL), you can do that kind of error checking directly on the same spec that you compile to produce your parser.
Most parser combinator DSLs use PEGs to parse, which don't actually prevent ambiguity, but instead hide it. I strongly recommend using something that compiles your declarative grammar into LL(k) or LALR(k)
I do sometimes wonder if a lot of these arguments themselves would be less ambiguous if PEGs (or some other name of them) were formally added to Chomsky's Hierarchy of Languages between Regular Languages and the languages expressed by CFGs. Though I think it would take someone with a much more rigorous math background than myself to make the case formal enough to get it accepted.
(If it helps, and it may be a red herring, the dualism between PEGs and Parser Combinators has me thinking it's a Category Theory related "step" in the hierarchy. "Deterministic Monadic Compositions" in a reflection of DFA/NFA duals to Regular Languages might imply "Monadic Languages" as a possible name? Again, my math background is definitely not formal enough here to make actual suggestions, but maybe it sparks an idea for someone else.)
You're not aware of them... because they never exist. A language specified by a recursive descent parser is never ambiguous in the first place. There's nothing to resolve.
I don't know where 'codify your bad assumptions' comes into it? What assumptions? How are they codified without you knowing?
The language your parser parses may not be the language you had in mind. Indeed, the language you had in mind may not actually exist. It’s easy to operate based only on examples, and think you know what your language is, when in fact there are cases you didn’t consider. Your parser will resolve the ambiguity (because it has to), but if the tool had told you about the ambiguity, you might have redesigned the language. The result is often that one of your users will discover the ambiguity instead, and then it may be too late.
As an example too well-known to actually occur, suppose you parse “if x then if y then x else z” with the else clause associated to the first if statement, because that’s just how you happened to write the parser. It’s not ambiguous, but your users won’t be happy when they find out the unambiguous rule.
> The result is often that one of your users will discover the ambiguity instead, and then it may be too late.
But there are no ambiguities in recursive descent! They won't discover them because... they don't exist! It's literally impossible.
> As an example too well-known to actually occur, suppose you parse...
Great example - and the good thing about recursive descent means it's impossible to write this ambiguously - you must prefer one or the other.
> It’s not ambiguous
Correct. So why are you telling me the problems of ambiguity?
> but your users won’t be happy when they find out the unambiguous rule
Why won't they be happy? You can tell them exactly how their code is going to be parsed. I thought ambiguities was bad but now you're complaining about ambiguity as well?!
We seem to be talking past each other. :) Of course there are no ambiguities in any parser (unless it returns multiple parses, which would hardly be practical). But that’s almost always because the designer had to resolve some ambiguities present in the grammar.
The question is, did the designer resolve them deliberately or accidentally? And were they resolved in an ergonomic way?
The grammar fragment “expr :: IF expr THEN expr ELSE expr” is ambiguous. If I write a statement like the example, I expect (from 60 years of language tradition and simple ergonomics) a language author to choose to resolve that ambiguity by associating an ELSE with the nearest IF. Telling your puzzled users “it’s not ambiguous, it always goes with the farthest IF!” isn’t going to make them happier. It also won’t work to say “well, + has higher precedence than /, but that’s OK, it always does that”. Those are just bugs in the language design caused by a bad resolution of ambiguity.
If you use a tool to flag the ambiguities in your grammar, then you can be sure all the resolutions in your parser are deliberate and not accidental.
> But that’s almost always because the designer had to resolve some ambiguities present in the grammar.
Not if they started with a grammar that didn't have any ambiguity in the first place.
I think you're coming from the angle that you always start with a CFG, resolve ambiguity, then write a parser.
Imagine that I never write a CFG for my language. No CFG exists! Instead - I start by writing a PEG, and then I write a recursive descent parser from that. At no point in this process have I had to resolve ambiguity. I didn't start with an ambiguous CFG and then write a PEG from it. I started with a PEG. It's never been ambiguous, and never will be ambiguous. There's no ambiguity.
> The grammar fragment “expr :: IF expr THEN expr ELSE expr” is ambiguous.
Right.... but I wouldn't write that because I'm not starting with a CFG I'm starting with a PEG.
> Telling your puzzled users “it’s not ambiguous, it always goes with the farthest IF!” isn’t going to make them happier.
I can't understand this - if I give them a simple well-defined rule that tells them what the code means they'll be happy. What do you think they'd want instead? No rule? A badly defined rule?
> Enter an expression on the left and see the parse tree change!
I don't get it. The left input field is uneditable in Firefox, Edge and Chrome. Navigation is broken across the board. Even copying and pasting the quote above somehow included most of the page even though I only selected that one sentence.
This looks very cool and I would love to try it out! :(
Yeah I think something is broken because at least in chrome when you try to type in the left box the following exception keeps being thrown in the console
VM463:82 Uncaught TypeError: Cannot read property 'startContainer' of undefined
at HTMLPreElement.eval (eval at run (eval at runScriptElement (octopus-2.js:525)), <anonymous>:82:40)
Author here - happy to answer questions, as always.
Thanks also for feedback on the format of the article. It's a bit of an experiment on how to present dense information effectively. Some more rationale here: https://tiarkrompf.github.io/notes/?/octopus-notes/
Octopus notes is an interesting idea. Thanks for the link on it.
And I do like to write a lot of little parsers. I worked with someone once who would somehow transform every business problem into “we need to write a compiler”. Sounds crazy but he achieved brilliant results and has gone on to great things.
Thanks for the other replies, but I was asking the original author for a suggestion using the presented framework, rather than an alternate algorithm or approach from someone else. As presented, the approach didn't seem to handle unary operators.
I probably should have noted that I am already familiar with Pratt parsing, which seems like something that isn't actually brain-dead simple and obvious in the same way (which is why it was worth writing a paper about in the 1970s.)
Hoping for a reply from the original author to recommend a simple approach to add unary operators.
Great question! Unary operators are really simple to add: you just look for the operator symbol first thing at the right level of precedence. Same idea as "if (peek == '(') ..." for parentheses, but outside the code that deals with '*' and '^' (if you want those to bind more strongly).
When parse_node is an operator you know it is the "-2" in "-2^-2(2+2)" and when parse_peek is the operator it is "-(2+2)" in the same. An example of this:
If you're writing a compiler, sure, a hand-written parser is probably better than a generated parser.
However, I'd also make the related claim that we should be much more ready to write formal parsers for other use cases. Far too often developers implement a "pile of regular expressions" where a parser—even a generated parser-would be a better choice.
First, there is an impressive overlap with what I like and what I have been doing. The notes on the piano, the parser, on GraphViz and on React all hit very close if not exactly things I've thought about or programmed.
Second, the means of presenting things is awesome too. I get lost very easily so there are probably ways to improve but this is very clever and interesting. The creativity and pedagogy here is incredible at all levels.
I am very impressed.
I need to keep an eye on this. I wish there was an RSS feed. I'll contact him.
The only annoying thing is that the scroll bar is in the middle of my screen. I wish they'd max-widthed the text while leaving the scrollable element full-width
I like the bit about CPS, but wish it show how do the last part without having the async machinery in the host language.
What I have a hard time to get is how implement fully delimited continuations. All the examples I know reuse the machinery of the host so is hard to translate (in my case, to Rust)
Considering that something as simple and limited as JSON has been a source of security vulnerabilities from ambiguities in the spec, thanks but nope. Declarative definitions of a grammar help identify those ambiguities... and if the grammar is defined separately from its interpretation, it opens the door to invisible ambiguities that can become another vector for vulnerabilities.
Unless you have provable implementations (great if you do!), declarative grammars meaningfully limit points of failure.
Spec ambiguities of JSON mostly come from the insufficient description of data model and well-formedness and not from the syntax itself. Say, to this day (including RFC 8259) duplicate keys from the JSON object are not explicitly forbidden, even though most applications require that. Probably the only issue arisen from the JSON syntax proper would be the treatment of line and paragraph separators.
I'd highly recommend looking at a few toy parser projects, like this and Crafting Interpreters (which are both recursive decent, with very different styles), writing however much of a typical parser you think makes sense "for fun," and then saving that somewhere you can get at it.
Writing a parser from scratch is kind of an obnoxious hurdle, but if you have some code you're already familiar with that you can copy-paste in and modify to your purposes, that's a lot more accessible.
(Also hats off to this author for making a parser so damn terse.)
man can dig hole. while digging one hole he finds two holes. he Diggs the one hole till he finds the bottom. climbs up to where he found the two the proceeds down the other hole. till he finds two more holes and then digs to the bottom of the first hole he finds until he hits bottom there. he descends until he hits bottom or digs through the earth. one hole at a time.
the digging is just the way your language constructs itself into different elements spread out like directories and sub-directories.
I don't want to be that guy but I originally thought the CSS was broken. Harsh black on harsh white, everything squeezed in to the left third of my screen, and an unused scrollbar on the bottom.
That being said, I love the information. I don't care for the workflow that comes with the most popular parser generators and I inevitably want more control over my parser as I start adding error handling, etc. It's easier to just write it by hand in the first place
I'm not quite sure why they went with the borked "half the screen is empty" look. Ok, it's easier to read, but that negative space is bad design, regardless.
I found the design to be very intuitive even though I'd never encountered that type of information flow before. The curly braces didn't seem to serve any purpose as far as I could tell, but the negative space was great. I knew exactly where the content was going to pop up before I clicked the "link". No jarring resize, just plain ole empty space not taking your attention away from the main content until you want see content in there.
I'd recommend reading the "Elkhound" paper [1], which introduces a GLR parser that's really easy to implement and very powerful, they use it in the paper to parse a large subset of C++. GLR parsers are so much more powerful and intuitive than recursive descent parser (I find), and they often make grammars easier to read and understand.
That said if you're building a real programming language it makes sense to build the parser by hand as it often runs faster. For example, my GLR Python parsers clocked in at around 100.000-300.000 lines of Python code per second, while the regular Python parser could do 5-10 times as much per second (including tokenizing, parsing and AST generation). Looking at the amount of Python code that is parsed every day, this is a significant difference.
As a previous student in Professor Rompf's compiler class, I had the best learning experience while writing the parser for a Scala-like language. The parser logic has since influenced not only how I code, also how I think.
1. Having a formal grammar definition as the authoritative spec
2. Using the formal grammar definition to generate your parser
#1 is quite valuable on its own, even if you don't want to do #2. It can be analyzed for ambiguities and inconsistencies, and you avoid the tar-pit of "the original implementation is the spec".
Whenever someone mentions parsers or parser combinators, the only thing that comes to my mind is ragel that's really suited for generating any sort of parsers http://www.colm.net/open-source/ragel/
How about no? Parsers are one of the main sources of bugs in software. You should write parsers manually only when you have a very good reason to. Go ask any of your system security friends.
This isn't bad advice across the board, but it is bad advice for many applications.
This is how you end up with context-sensitive formats, or ones where the semantics of a particular particle require running the parser, or worse. It exposes a rich surface area for exploitation with adversarial inputs.
That said, the advantages cited at the front of the article are all real: error recovery and decent error messages are unsolved problems with existing parser generators, although ANTLR is pretty good at it.
But, especially if you're developing the data format rather than just implementing it, I strongly suggest using something like Instaparse which can generate a parser directly from a grammar. Once this is solidified, it's reasonable to hand-roll something, but even there, I suggest using a nice parser combinator library, like hammer or Nom.
This provides more discipline: as long as you're coloring inside the lines, you'll end up with a parser which actually implements your spec. When you start adding functions to provide context, recover from errors, and provide meaningful feedback in the form of error messages for unexpected inputs, you'll know what you're doing: a classic recursive-descent parser can't provide the same conceptual separation between the parsing logic and the logic to support ergonomics or one-pass compiling or whatever add-ons you're putting in there.