Hacker News new | past | comments | ask | show | jobs | submit login
Implementing Regular Expressions (swtch.com)
48 points by acqq on Nov 6, 2011 | hide | past | favorite | 9 comments



Here is an excellent and readable paper on implementing regular expressions http://sebfisch.github.com/haskell-regexp/regexp-play.pdf

They also extend it to weighted regular expressions and even lazy, infinite regular expressions.


The "Implementing Regular Expressions" page gives the full (scientific) background to

"re2 - An efficient, principled regular expression library" http://news.ycombinator.com/item?id=3204427


Also Russ Cox is one of the main hackers in the Google Go team and wrote the new (pure) Go regexp engine based on his work on re2.

Of course the Go team also includes Ken Thompson who popularized regexps with Unix, and Rob Pike of 'structural regular expressions' fame: http://doc.cat-v.org/bell_labs/structural_regexps/


Excellent links. And this was just on my mind in fact.

In most web-servers there is a list of regexes - typically starting stems, almost more glob-like - that are defined in precedence order to match requests to handlers.

Are there any libraries that can treat this as a single effective operation, rather than just testing against each regex as an island until there's a match?


Are you are interested in the parsing of the HTTP requests? I think I've recently seen some hand-coded state-based C implementation (I guess it can even be O(n)?) in some fast open-source HTTP server.


The last time I looked at nginx it parsed the HTTP request with a hand-coded state machine but routes that are defined by regular expression matching were handled one expression at a time.

Routes that are not defined by regular expressions used a much faster tree based method.


That's exactly what lex/flex and similar tools do, isn't it?

http://flex.sourceforge.net/


My understanding is that in lex/flex the expressions are known at compile time and c is generated from them.

For a web server the regular expressions are held in a configuration file and are not known until runtime.


Lex and Flex both generate C (Flex has an option to output C++ instead, IIRC), yes. There are similar tools for other languages though: Java has JLex, ANTLR, etc. for example.

Re: the config file - how often does it change? Does it really need to be deferred to runtime, or is it just startup time? My (probably out-of-date) understanding of rewrite rules in apache, for example, is that you need to bounce the server for it to pick up changes; that would also be sufficient for it to pick up changes to the flex file and rebuild. Running flex and compiling the output is quick enough to be instantaneous from the point of the admin restarting a server.

If it really can change at runtime, I guess the point is to avoid downtime while updating the configuration? There are ways around this too, some more practical than others. If your servers are behind a load balancer you can do a rolling restart to pick up changes. If that's not an option it's possible, though probably not trivial, to hot-patch the native code. YMMV. :-)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: