Hacker News new | past | comments | ask | show | jobs | submit login

Pet peeve about compiler education that is not repeated by this article (yay, well done!): the way you have to wade through months of parsing bullshit to get to the really interesting stuff (optimization and static analysis).



"the way you have to wade through months of parsing bullshit to get to the really interesting stuff (optimization and static analysis)."

If you look back in the late 70's, you'll discover something interesting: Almost everything was parsing of one form or another.

Really. Parsing of flowgraphs, etc. Dataflow was done not just by bitvector methods (which were fairly new), but by parsing of graphs using graph grammars.

Most optimization analysis are very closely related to parsing (and can often be solved by graph reachability). The complexity bounds of most of them are also driven by parsing complexity results.

For example, andersen-style points-to analysis is solvable as a context-free-language problem. The time bounds of it are driven by the time bounds on context-free-language parsing. In particular, all-pairs reachability on Context Free Languages (N^3).

etc Parsing is the basis of compilers for a very good reason. The fact that we have advanced other types of methods on top of analysis for optimization and analysis does not change the fact that, at the core, a lot of them are based on equivalence to parsing of graphs and graph reachability.

Another way to look at it: If you ever want to be able to be really good at compiler optimization, don't skip parsing. Otherwise you will miss the connections that are really important to being able to design efficient new algorithms, instead of just learning old ones.


Parsing also comes up in a stupidly large number of other areas within programming, including: serialization formats, network protocols, DSLs, log analysis, time series interpretation, NLP, unstructured data mining, and probably several I'm not familiar with. Fundamentally understanding what's going on when you're doing recursive-descent, shift/reduce, or combinator-based parsing unlocks a big toolbox that you can use to solve a number of other problems.


I'm not saying it isn't useful - I'm saying it's not interesting (and I'm talking about the act of parsing, not the algorithms as applied elsewhere). Think about this: you start learning compilers, you spend months on parsing, and after that what do you have? Nothing - you "compiled" stuff into an AST. Yawn.


"I'm not saying it isn't useful - I'm saying it's not interesting "

Then you are doing it wrong :)

"(and I'm talking about the act of parsing, not the algorithms as applied elsewhere)"

I'm not sure what this is supposed to mean. It sounds like you are trying to separate parsing text into ASTs and parsing flowgraphs, so you can claim one is boring, and the other is "algorithms as applied elsewhere". There is no such separation. We just happen to name variables a little differently so it doesn't quite look like "figure out if this one parentheses matches this other one". But it is the same algorithm. If you are trying to separate parsing as the thing that happens that makes an AST, that is not the only thing that is parsing. Yes, some books colloquially separate phases like that, but it's silly.

"Think about this: you start learning compilers, you spend months on parsing, and after that what do you have? Nothing - you "compiled" stuff into an AST. "

You are definitely doing it wrong. The same parsing of a grammar you taught someone to make an AST, you can now say "we can apply a slightly different language grammar to the AST and generate machine code", or "apply a slightly different language grammar to the AST and parse out flowgraphs and statements"

etc parsing isn't just take a bunch of text, turn it into an AST. The fact that a ton of people think this is one reason there are so many crappy little compilers that never go anywhere :)


We're having two different conversations here. You are saying (correct me if I'm wrong) that the algorithms you use for parsing are broadly applicable to program analysis. You are not wrong. It's just orthogonal to my point, which is that program analysis is really interesting.

To make an analogy here (from Lockhart's lament: https://www.maa.org/external_archive/devlin/LockhartsLament....), you are saying that color theory is massively important throughout one's career as a artissti. I am saying that I just like painting things.

Sure, there's lots of cool shit in parsing theory no doubt, with tons of broad application. But the way that we are taught it ("here's how we get an AST from written text") turns away people who are into program analysis. As such, we shouldn't consider parsing a pre-requisite for program analysis, regardless of how interesting or applicable the theory of parsing is.


I think DannyBee's using a slightly broader definition of parsing than you're used to. There's the narrow definition of parsing ("Here's how we go from a sequence of bytes to an AST") that's the concrete material taught in compiler courses. But there's also a much broader paradigm where "parsing" is broadened to "Here's how we go from a listed set of patterns, possibly overlapping, to a set of transformations defined by those patterns." The source input need not be a linear set of bytes; you can trivially extend it to trees by taking an in-order traversal of the tree, or even operate directly on tree or graph data structures.

With the broader definition, a large number of what you consider the interesting parts of program analysis are actually parsing problems. Instruction selection, for example, involves tiling the AST (or more typically, the IR) with a minimal set of machine instructions. In many compilers, this is implemented with a machine-definition file that describes a grammar for transforming the IR into machine-language instructions. Peephole optimization involves recognizing certain patterns (eg. division by a constant power of 2) and then transforming them into more efficient constructs (eg. shift-right). This can also be described by a grammar on the IR.

I'm not sure if this is explicitly taught in compiler courses - it wasn't really in mine, but it was alluded to - but it's a really powerful way of thinking of the passes in a compiler. It lets you unify concepts, so that instead of memorizing all the specific algorithms used, you can think of a pass as "this is a transformation from this specific input language to this specific output language".


As much as I enjoyed optimization and static analysis, parsing is probably the single most useful knowledge I got from my formal CS education; and probably the only part of a compiler that most CS students will ever need to know.


Yes, I think understanding that there's a fundamental difference between the operation of parsing a language and "naive call-and-for-loop programming" is very important. Most naive efforts to parse a language starting with no knowledge simply fail (and that can include, say, things efforts to validate a file name in a somewhat complex file system or to parse a packet in a network protocol).

In college, it took me a bit of time to understand what a formal language really was and what the difference between a parser and a simple string processor was and when I arrived in industry, I notice this were something a certain proportion of otherwise experienced programmers simply did not understand this.


The symbiotic relationship between compiler development and hardware architecture design has a defining role in CS problems, and while you don't need to understand that relationship to write CRUD apps for a living, I think ignorance of it robs those who really want to understand this stuff of truths both technical and profound.


Totally!

Compiler development = Finite State Automaton + Push Down Automaton + Context Free Grammar + Intermediate representation + Static Analysis/Optimiation + Stack Machine + ...

Hardware Architecture = Stack Machine/Turing Machine + electricity + creativity

Stack Machine == Turing Machine

Compiler development starts and ends with Turing Machines.

Hardware architecture ends with Turing Machines operating at the speed of electricity through metal.


There are alternatives to this mainstream style of computer architecture, such as dataflow or hybrid quantum-classical systems, which both require a very different kind of compiler.

https://en.wikipedia.org/wiki/Dataflow_architecture


A recursive-descent parser is probably the only part of a compiler which people will easily derive on their own, even without taking a compiler course.


Funnily enough it's also where a bunch of mature compilers eventually ended up with, because of speed, heuristics and so forth. Like GCC, which used to have bison based compilers IIRC, but but is hand rolled rec dec...


It's not so much for speed and heuristics but for error reporting and/or recovery.

As it turns out (and I've tried many times myself too), designing a parser generator that will retain a compact representation without losing good error reporting is very hard.

E.g. consider a BNF type grammar. Almost every symbol represents a potential error. But often deriving that error message automatically in a way that is helpful to a human is incredibly hard because the grammar lacks information about which mistakes are more likely for humans to make.

So we end up with bad error reporting, or you end up having to write code to analyse errors anyway, and then a lot of the benefits disappear - the actual parsing is usually easy to do compactly with a recursive descent parser and a tiny library of helpers anyway.


Yea, I was imprecisely using heuristics to primarily be about error reporting. You essentially have to partially accept a wider set of grammar, to be able to throw more intelligent errors.

But speed ime also is a major issue. GCC sped up a good bit when switching, and we are seeing speed issues with bison generated parsers.

Ironically one of the benefits of bison is that you get feedback about whether your grammar is conflict free. Can be a significant advantage over a hand built recdec parser for a complex language like SQL.


Speed certainly can be a nice bonus, but you can also generate recursive descent parsers fairly easily if you have a suitable representation. My own parser generator attempts did that.

The feedback certainly is useful, and I'm all for people using such tools to validate grammars for testing. Though people manage to abuse that by writing extra code too (Ruby being a prime example of using a parser generator and then abusing the hell out of it in ways that complicates the grammar).

I really wish more work would go into research on automating error reporting, though. As much as I handwrite most of my parsers now, I wish I didn't have a reason to.


> I really wish more work would go into research on automating error reporting, though. As much as I handwrite most of my parsers now, I wish I didn't have a reason to.

Indeed.


I recently made myself sit down and learn leex (essentially lexx for Erlang), and I found that to be an extremely useful exercise. It helped me understand how to think about what kind of inputs a parser can make use of, and when it makes sense to skip the tokenizing step.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: