Hacker News new | past | comments | ask | show | jobs | submit login
Step 1) Install pyre2 2) import re2 as re 3) 10x faster regex (github.com/axiak)
97 points by axiak on July 23, 2010 | hide | past | favorite | 27 comments



If you're not familiar with re2, you may be familiar with a couple of other little projects that the author, Russ Cox, is involved with: Go (http://golang.org/) and Plan 9 (http://swtch.com/plan9port/).

Also, here's a great bit of technical history behind Russ' re2 implementation, and why pyre2 will never be completely compatible with Python's re (without fallback):

http://swtch.com/~rsc/regexp/regexp1.html

http://swtch.com/~rsc/regexp/regexp2.html

http://swtch.com/~rsc/regexp/regexp3.html


Google Code Search was his intern project: http://googlecode.blogspot.com/2006/10/search-worlds-public-...


What kind of licensing does this have?

It would be interesting to see if this could make it in Python 3.2 (It's supposed to include the Unladen Swallow stuff, so might as well take more awesomeness from Google.)



It won't. It doesn't implement the entire regular expression language Python supports.


Could a regex library parse a regular expression and decide reasonably quickly which engine to use?


Probably, but python-core has no interest in maintaining multiple regular expression implementations: http://mail.python.org/pipermail/python-dev/2010-July/101606... (this thread starts talking about regex, which is a backwards compatible enhancement to re, but also covers re2)


Yes. That's precisely what pyre2 is doing. It's trying re2, and if it fails because it doesn't support the features, it tries the python re. Since compilation happens rarely, this is fast.


Please note that RE2 doesn't support backtracking.


if you're using backtracking, maybe "regular expressions" aren't the right tool.


What would be the right tool, then?


Grammars are a good start. Mix in parse-controlling predicates if that isn't powerful enough (C++).


note it depends upon cython: i suppose it's more likely it will make it into cython rather than python 3.2.


Like the libxml module which uses Cython, the compiled CPP module is distributed along with it, so you don't actually need Cython to install it or compile it. But you would need Cython to do any development on it.


BSD (as does the google RE2 C++ module)


I tried using pyre2 and experienced runaway memory usage. I was using re.split. The behaviour went away when I switched to the standard re module. (without changing my code.)

Did anyone else experience this?

The code in question was:

(a, b) = re.split("\n\s*\n", text, maxsplit=1)

Thank you for any insights you may have regarding this.


Could this be uploaded to pypi, so that it is available via pip ?

pip install pyre2 -E /tmp/testing


If you read the readme under installation, you'll find that it is.

pip install re2 -E /tmp/testing


I'm curious. How much of a speedup in regex when using DFA instead of NFA? I believe most regex implementation use NFA but there are well known algorithms to convert NFA to DFA. It should be worth the try.


The problem is that in the worst case, converting an NFA to a DFA gives an overhead of 2^n, while keeping it as an NFA only incurs an overhead of m (where m is the regex length).

So, in the worst case (ie, under algorithmic complexity attacks), the NFA implementation is O(m * n), while the complexity of the DFA implementation is - I believe - O(2^m * n)


And fortunately, if re2 can not handle one of the regexen that you throw at it, then it will pass it down to Python's original re module (NFA_based). Perfectly seamless.


Both re2 and Python are NFA based. The difference, I believe, is that re2 doesn't support backtracking, which changes the worst-case time from linear growth to exponential growth on the input length. (For the snobs, that means that re2 regexes are regular expressions. Python's aren't.)


The difference is that re2 avoids backtracking in most cases where Python's normal regular expression engine would use it. And by avoiding backtracking, avoids the possibility of exponential slowdowns.


Isn't that what I just said?


No. "Doesn't support backtracking" suggests that it doesn't support features that re backtracks for. In fact it does.

Also it is technically possible for re2 to (although it doesn't currently do this) add backtracking to its current strategy. If it did this, then it could offer all features of re, but would frequently be much faster. However I would not permanently rule that possibility out.


I'm talking about the C++ re2 engine, not the Python wrapper.


So was I.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: