Hacker News new | past | comments | ask | show | jobs | submit login
BNF was here: What have we done about unnecessary notation diversity (2011) [pdf] (grammarware.net)
60 points by susam on March 19, 2023 | hide | past | favorite | 33 comments



So much of coding requires sitting inside the mind of some mid-20th century dude with a very different model of the world than we have.

Some of you probably work within SAP, and it's the same kind of problem - you have to audition for the part of some random German business logic nerd until you maybe get it. And it's still effortful every time.

It's worth mentioning that standing on the shoulders of giants is strange when so many of those giants are still alive and working on their monoliths. The whole experience is as much a cultural and historical and borderline religious pilgrimage as it is a syntactical exercise.

A bit like getting a vague answer to an Old Testament God prayer, I understand C++, Python, etc. on a deeper level not by programming, but watching the creators talk about their languages. If you hate C++, go watch Bjarne get interviewed. The language will look like his handwriting the next time you open your IDE. It's so weird.

If you struggle with Linux, go watch YouTube videos where Linus Torvalds talks for a couple hours. I swear the next time you open a command line or work within a slipshod hideous UI that I guess just works, or something, you'll somehow feel this uncanny relief because you know Linus.


> If you struggle with Linux, go watch YouTube videos where Linus Torvalds talks for a couple hours. I swear the next time you open a command line or work within a slipshod hideous UI that I guess just works, or something, you'll somehow feel this uncanny relief because you know Linus.

Doesn't make any sense , Linus Torvalds doesn't work on making command line or UI, I can understand if this was about writing kernel drivers or writing programs calling kernel API , The command line and other UI are much abstracted away from the kernel no?


To be clear, I don't have a deep understanding of Linux. But let's keep going with your train of thought regarding abstraction.

When I know it's "January" for whatever reason, and not "First Month", without knowing HOW and WHY this is the convention, it sucks. But then you see what Janus looks like - which, in this case, makes it even better for some reason than "First Month." In the case of programming languages and systems, many of these mythological figures are still alive and we can ask them questions.

So, with Linus, I'm approaching it from the individual differences angle. Some people will call IT if "Smooth edges of screen fonts" is turned off. "Why is this like this?? I can't work like this!"

Whereas I don't think Linus would notice, and if he did I doubt he would care. There's something about understanding where a creator of any given thing falls on that continuum that I find really helpful in overcoming friction with the "why is bad thing that could be good, by my likely ignorant definition, so bad!!" when it comes to new languages or systems.

Tangential, but I want to go back in time and ask some ancient Danes why #70 and #90 in Danish simply had to be absolutely ridiculous, given any possible alternatives. It still wouldn't be my preference, but something about understanding their perspective would be helpful.


> When I know it's "January" for whatever reason, and not "First Month", without knowing HOW and WHY this is the convention, it sucks. But then you see what Janus looks like - which, in this case, makes it even better for some reason than "First Month."

That's interesting, especially because I can't really relate to that. Don't get me wrong, I love this kind of trivia, but not knowing something like that has never been a problem for me at all.


I had the same thought!

Is this perhaps related to learning a language as a child versus as an adult?

As a child in an English-speaking household, "January" is just one of a million other things to pick up, and you just pick it up. As an adult learner, I can imagine it's very difficult because it seems so random; it's yet another thing that doesn't fit into a system so you have to memorise it and try to internalize it.

As a native English speaker, I find it very difficult to remember weekday names and month names in other languages. Mnemonics help, and knowing the derivation of the word can make for a good mnemonic.

Edit: re-reading the GP comment, sounds like it's not so much about mnemonics, more that knowing the historical reasons for weird design decisions can make it easier to accept them. Like knowing why "October" isn't actually the eighth month.


I think it's easier for me to accept weird natural language things from the distant past.

But for relatively recent programming language things? Oh man.

std::cout << "why"; std::cout << "why"; std::cout << "why";


std::cout << std::endl;

// It just rolls off the tongue.


    using namespace std;
It's fine, I won't tell anyone.


> So much of coding requires sitting inside the mind of some mid-20th century dude with a very different model of the world than we have.

Not sure if the some people view of the world should impact the way programming languages are constructed.

It's like saying you dislike math proofs because you have different views on the world from Gauss or Hilbert. That a triangle might be a rectangle because Pitagora had an old view of the world which doesn't fit today's fashion and politics.


> very different model of the world than we have

Who are we? There is no we. There is no we against them. That's the thing about diversity. My background and world view is different than yours.


People like me. You're illustrating my point.


Linux/Linus maybe isn't quite the best example here perhaps because it was heavily inspired from MINIX and other Unix-like OSes.

But yeah I can see that understanding the roots of a technology, of the creator's views of the technology and the problems its solving would certainly alter our perception of it and make it more enjoyable.


There's a lot of things that reasoning about requires getting into someone else's headspace. If that is problematic you may want to look into your empathy skills, 'getting into someone else's headspace' is a crucial life skill, IMO


This emotive approach to computer science might be inefficient but meaningful for rare questions of taste and ideology (e.g. substring functions: lengths or indices?), but I don't see how it could possibly help learning hard details (e.g. what escape codes can show up in your strings?) or even universal theoretical concepts (e.g. how do you use these synchronization primitives correctly, and why?).

From a more positive angle, you shouldn't worry about understanding "a very different model of the world" because, being a mathematical model of the world it cannot be much different from yours.


> If you hate C++, go watch Bjarne get interviewed.

That's why you shouldn't watch interviews. If they're good speakers they'll trick you into adopting their views.


Or you'll gain a tiny bit of empathy and understanding of where they came from and what it was like walking in their shoes. Doesn't mean you have to agree with everything they say.


Don't use the old ISO 14977:1996 specification for BNF. For a rationale, see: https://dwheeler.com/essays/dont-use-iso-14977-ebnf.html

Quick summary:

1. It is unable to indicate International/Unicode characters, code points, or byte values. ISO/IEC 14977:1996 only supports ISO/IEC 646:1991 characters.

2. It is unable to indicate character ranges.

3. It requires a sea of commas, so using it produces hard-to-read grammars.

4. It does not build on widely-used regex notation

5. It has a bizarre, difficult-to-understand, and easily-misunderstood “one or more” notation.

6. It is challenging to understand and many key terms are undefined.

I suggest considering the EBNF notation from the W3C Extensible Markup Language (XML) 1.0 (Fifth Edition). That's much closer to the regex notation used by many software developers today, and since the point of a BNF is to communicate, using a language similar to a language most developers already know is a big advantage.


The diversity of notations stems from the fact that the current BNF notations are always lacking in something fundamental that each use case needs.

For me, the breaking point came in trying to describe binary formats. I eventually had to come up with my own: https://github.com/kstenerud/dogma

Will it solve the BNF problem? No, but it does help the binary folks.


> current BNF notations are always lacking in something fundamental that each use case needs.

There are two ways to resolve this. 1) In a way that is backwards-compatible with BNF, and 2) in a way that isn't.

If you way is backwards-compatible, it's going to be much more easily understandable to anyone who already knows BNF. They don't have to go back and double-check whether some construct means concatenation or an alternative form.

It also means that if someone is working on a grammar that doesn't need one of your new fundamentals and is expressible in plain BNF, they're going to write something that anyone who already knows BNF can read without any extra research.

Also, if your additions are good enough, they provide a good draft of a new way to extend "official" BNF in the future. If you extend in an incompatible way, that's not even a possibility.


I've seen both approaches used. And in both cases, nothing comes of it outside of the specific project (for an example, see the BNF in the XML spec).

What you end up with is a bunch of small innovations in each project, but no incentive to consolidate these gains (because it's so niche to begin with, and there's no significant business advantage to be had).

And of course anyone who tries will have massive pushback ala https://xkcd.com/927

So everyone just does their own thing and we pretend that it's all fine.


In case you missed it, this paper as I understand it is not about proposing a new "one true bnf" to become the new standard, it's about making meta-specifications for easily writing down a formal spec of how the language grammar is written, which can then be used by tools to translate and compare between the different standards.

From the abstract: "instead of adding another syntactic notation and arguing about its excellence, we propose to retain the diversity and to cope with it by formally defining syntactic notations and using such definitions to import existing grammars to grammar engineering frameworks and to export (pretty-print) existing grammars to any desired syntactic notation."


There is no standard, should we have a standard? Who should design such a standard? Is a form of EBNF the best for representing computer language grammars for both humans and machines?

Recently I read a paper "A Translational BNF Grammar Notation (TBNF)" By Paul B Mann, Parsetec.

"EBNF is powerful, however, it describes only the recognition phase, which is only 1/3 of the process of language translation. The second phase is the construction of an abstract-syntax tree and the third phase is the creation of an instruction code sequence. Some parser generators automate the construction of an AST, but none, that I know of, automate the output of instruction codes."

I'm not entirley sold on the idea but it is interesting to think of what could be different in this space.


There is a standard. At least two, actually. There is "ISO/IEC 14977:1996(E), Syntactic Metalanguage: Extended BNF" (which was mentioned in the article), and RFC 5234, "Augmented BNF for Syntax Specifications: ABNF" (which is used by almost all newer RFCs). Take your pick!


14977 is a disaster, do NOT use it. See: https://dwheeler.com/essays/dont-use-iso-14977-ebnf.html

IETF's RFC works, but it has its own problems. In particular, most people use regexes constantly, and that RFC is unnecessarily incompatible with widely-used regex format.

I suggest looking at the W3C Extensible Markup Language (XML) 1.0 (Fifth Edition) as a plausible starting point: a href="https://www.w3.org/TR/xml/#sec-notation


I like how Coq describes its grammar (https://coq.inria.fr/refman/language/core/basic.html), though it's definitely not a standard. The Coq grammar and language is very complex with edge cases for niche features so it's good they have a detailed reference (also because other than CPDT and Software Foundations there aren't many references or tutorials out there...).

IntelliJ Grammar-Kit also uses sort of EBNF (https://plugins.jetbrains.com/plugin/6606-grammar-kit), the file extension is even `.bnf`.

Even though parser generators are often much different than EBNF, I still wish more people followed this standard, and parser generators could create EBNF grammar definitions. Because it's not just for parsing directly: it's a good starting point for creating parsers; a reference to validate and regression test; and it can be used to generate a (sometimes ambiguous) parser which is good for prototypes and toy languages.


It seems to be a compact graphical representation.

Speaking of graphical representations, here's a less compact one, but more intuitive: https://sqlite.org/syntaxdiagrams.html


It's maybe a bit of a hot-take but notation diversity was quite necessary. It's easy to forget that BNF was largely built for textbooks and white boards and it was designed primarily from the perspective of language generation. While language generation and language parsing are mathematically duals of each other, BNF was always under-specified and under-qualified to use for describing parsers, especially for describing (easily or cheaply) computable parsers. A diaspora of notations was somewhat inevitable, and there was never really a "golden age" of BNF for describing parsing grammars. (Even classic tools like yacc/bison were always custom-extended variants of BNF, and BNF always had to be written with a specific parser model in mind; LL(1) grammars were never immediately compatible or comparable with LALR(1) grammars or LR(1) grammars, etc.)

BNF will probably always be a useful communication tool, but it has always been deficient for describing grammar parsing and it certainly will never be useful for grammar comparison like this paper seems to be looking for.


Guy Steele has a great talk about the various dialects that exists https://youtube.com/watch?v=7HKbjYqqPPQ


That's not just BNF in its usual sense though.


BNF needs to be understood in context. It's not sufficient or complete for all things because its reductionist to the class of problems its a good fit for.

It works as ABNF in IETF contexts. It works in ASN.1. Therefore it works for SNMP mibs. It can define to the bit level PDUs "on the wire".

It can also in some sense notate the semantics if you intuit above the syntax of the choices of one, any, many as to WHY you are proferred the alternates and what they mean. The problem is that most of the logic about WHY is encoded in comments, or out of scope. It only denotes the forms.

The intersection of BNF and UNIX man page commandline arguments remains probably its most visible context. cat [ -v ]

I learned some variants BNF/EBNF in 1979 to learn DEC-10 instruction set semantics. Pascal, Wirth had written diagrams which morally encoded the same information but in a visual form. It irked me that there was no connect/uplift from (E)BNF into the lex part of lex/yacc functionality on the compiler course. We dived into lexical analysis without discussing a restricted grammer for representation of things which syntax we'd all been taught. I guess its what you know times who you learn it from times what relevance it has.

There is some venn diagram collision in my mind with BNF variants and regex. Perhaps the inevitability of [ and { representative forms, with | for alternates.

I don't write DSL so I don't play in this space but I use BNF almost unconsciously every day, especially if its a 'read that IETF draft' day.


and having noted the reasons for ISO EBNF failure

They didn't mention what IMHO might be the most important one: ISO EBNF is different enough from traditional regex notation that it's uncomfortable to read for those familiar with the latter. (At least it's not ABNF, which I think is even worse.)

That said, it seems the majority of grammar notations I've seen use | for choice, ? for zero or one, * for zero or more, + for one or more, and ( ) for grouping, with implicit space-concatenation (higher precedence than choice, by analogy with multiplication and addition). This is essentially standard regex metacharacters, and context-free-grammars are a step above regular grammars, so it's not surprising that such a notation developed.


Programmer has N problems: too many syntax specification conventions. Programmer says: I know, I’ll design a formal syntax for specifying syntaxes. Now programmer has N+1 problems.





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: