Hacker News new | past | comments | ask | show | jobs | submit login
Parsing regular expressions with recursive descent (might.net)
84 points by joeyespo on July 3, 2012 | hide | past | favorite | 16 comments



If you liked this you may also enjoy http://genius.cat-v.org/brian-kernighan/articles/beautiful (simple regex engine in C by Rob Pike, article written by Brian Kernighan).


His other post on matching using derivatives is very interesting: http://matt.might.net/articles/implementation-of-regular-exp...


See Might's talk at http://www.stanford.edu/class/ee380/ay1011.html .

Parsing with derivatives can take advantage of the Pratt hack to go even faster.


If you want to build a regular expressions parser as a learning exercise, you might want to look into automatons. Lots of easy to grasp (and best of all: visual) concepts to learn there.


Second that. If you are going to go down this route, I wrote a simple tutorial with lots of visuals to construct a regular expression from a finite state automaton using JFLAP ( jflap.org ), for the "inverse fizzbuzz" problem - http://www.jasq.org/2/post/2012/06/fbfbfbfbfffbfbffbffbfbffb...


This one's great as well for understanding RegEx FSM's: http://rise4fun.com/rex


Hey thanks! I fed the inverse fizzbuzz regex to Rex and it came up with an automaton with 81 states! http://www.jasq.org/uploads/1/0/5/6/10564637/6329774.png?694

My original automaton only has 8 states. ( http://www.jasq.org/uploads/1/0/5/6/10564637/977128_orig.png ) I guess that's the price you pay for cycles. ( Mine is a cyclic graph, but Rex produces a directed acyclic graph. Since they both must capture the same regex, one of them would be much larger than the other )


Russ Cox put the plan 9 grep's parser source on-line: http://swtch.com/usr/local/plan9/src/cmd/grep/grep.y

Hey, it's a yacc-compatible grammar.


I wrote a detailed set of articles for GameDev back at the time (2004) that goes the whole way - full regex matcher (for simple regexes) including NFA/DFA construction.

First two articles:

* http://eli.thegreenplace.net/files/docs/forays/col1.html

* http://eli.thegreenplace.net/files/docs/forays/col2.html

The rest listed here: http://eli.thegreenplace.net/articles/


This series of articles [1] by Russ Cox is also interesting for anyone interested in parsing regular expressions.

[1] http://swtch.com/~rsc/regexp/regexp1.html

See at the bottom of that articles for links to followups


Heh, came just at the right time... About to do a project on regexs. I've implemented a regex engine before, but never a the parser bit.


> Every programmer ought to have the experience of implementing a tool for matching regular expressions from scratch.

Umm... I'm sure it's fascinating, but I've really got better things to do with my time, which is why I use RegExs in the first place :)


You also probably have better things to do with your time than writing your own webserver, or creating your own language and compiler, or creating a database. But every time you do one of those things, you learn a lot about the things you actually use, and you end up being a far, far better programmer on the whole. If you only do the things you need to do, you're never going to be a great programmer.


Well of course, I couldn't agree more!

My point was that the original post seemed really judgmental in implying that building a RegEx engine ought to be done by everybody. Instead of "every programmer ought to have the experience", they could have said, "it can be valuable to have the experience".

But a blanket statement that every programmer should do X... is really annoying and shows a very narrow worldview. Programmers are diverse, and do lots of different things.


Wow, overreaction... there is a standard human communication idiom that is being followed here. For example movies are labeled as "must see", food is described as "you have to try...", and so on. Everyone[1] understands this to be a hyperbolic phrasing that isn't to be taken literally, but to really emphasize how important the speaker[2] finds the experience. Relax a little on the pedantry, it invalidates any statements made about better uses of one's time.

[1] The sentence that references this footnote uses the phenomenon it describes for emphasis.

[2] This is a similar, but distinct phenomenon of using a term, in this case speaker, with the implicit understanding that it can be generalized, in this case to "communicator". It shouldn't be confused with the situation described, but is also a useful pattern to understand for basic communication.


Yeah, but writing your own engine will help you understand why your innocent-looking expression has stupid exponential performance. (Assuming you implement something akin to Perl-style regexps that use backtracking rather than actually regular expressions using an NFA.)

It will also help you understand why engines like re2[1] are absurdly faster than normal engines.

[1]: http://code.google.com/p/re2/

In short: it's a great learning exercise. Also, I suggest learning the basic theory behind it (Chomsky hierarchy and so on)--it's very fascinating.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: