Unicode is a PITA. In Hyperscan, it's not pretty what gets generated for a bare ...

burntsushi · on March 30, 2019

Yeah, I've looked at glushkov based primarily on your comments about it, but Unicode is always where I get stuck. In my regex engine, Unicode is enabled by default and \w is fairly common, so it needs to be handled well.

And of course, one doesn't need to bet the farm on a lazy DFA if you have one, although it is quite robust in a large number of practical scenarios. (I think RE2 does bet the farm, to be fair.)

glangdale · on March 31, 2019

Unicode + UCP is a perfectly principled thing, but it wasn't a design point that made any sense for Hyperscan as a default. The bulk of our customers were not interested in turning 1 state for ASCII \w into 600 states for UCP \w unless it was free.

I think both Glushkov and Thompson can be done fast, but I agree that they are both going to be Really Big for UCP stuff. Idle discussions among the ex-Hyperscan folks generally leans towards 'NFA over codepoints' being the right way of doing things.

Occam's razor suggests if you do only 1 thing in a regex system (i.e. designing for simplicity/elegance, which would be an interesting change after Hyperscan) it must be NFA, as not all patterns determinize. If you are OK with a lazy DFA system that can be made to create a new state per byte of input (in the worst case) then I guess you can do that too.

I am not sure how to solve the problem of "NFA over codepoints", btw. Having no more than 256 distinct characters was easy, but even with remapping, the prospect of having to handle arbitrary Unicode is... unnerving.

burntsushi · on March 31, 2019

Yeah, my Thompson NFA uses codepoints for those reasons. But not in particularly smart way; mostly just to reduce space usage. It is indeed an annoying problem to deal with!