One thing that was interesting to me: The outage was caused by a regex that ende...

emmelaich · on July 13, 2019

I think it's not uncommon. I've seen it in two places recently.

1. A test job in CI/CD pipeline suddenly taking a very long time and lots of cpu

2. A data cleansing / checking job in a Java webapp occasionally turning the machine to molasses.

In both occurrences the regex had been around for a while; what happened is that the data was different. e.g. Lots of trailing whitespace.

Abishek_Muthian · on July 13, 2019

Just to note Go Lang uses RE2 in its regexp[1].

[1]:https://github.com/google/re2/wiki/Syntax

asdfasgasdgasdg · on July 13, 2019

Yes, this is one of the regexp engines the post discusses switching to as a mitigation.

bsder · on July 13, 2019

The problem is that a deterministic regex engine (deterministic finite automata or DFA) is strictly less powerful than a non-deterministic one (NFA). DFA's can't backtrack, for example. In addition, DFA's can be quite a bit slower for certain inputs and matches.

auscompgeek · on July 13, 2019

Actually, it is proven that NFAs and DFAs are equally expressive. See https://en.wikipedia.org/wiki/Powerset_construction

bsder · on July 13, 2019

"You are technically correct. The best kind of correct."

In theory, your statement is perfectly correct. However, quoting that reference:

"However, if the NFA has n states, the resulting DFA may have up to 2^n states, an exponentially larger number, which sometimes makes the construction impractical for large NFAs."

This means that in practice, DFAs are larger, slower, and sometimes can't be run at all if complex enough.

However, this was my mistake. I remembered (vaguely) the 2^n issue and didn't follow up to make sure I was accurate.

And I completely spaced on the fact that neither NFA's nor DFA's handle backreferences without extension.

Thorrez · on July 13, 2019

I don't know what you mean by "DFA's can't backtrack". Maybe you mean DFAs don't support backreferences, which is true, but NFAs don't support backreferences either.

I believe if r is the size of the regex and d is the size of the data, an NFA is O(r) to compile and O(rd) to execute, while a DFA is O(2^r) to compile and O(d) to execute. So DFAs are slower to compile, but faster to execute.

nabla9 · on July 13, 2019

The problem is not DFA vs NFA.

“regular expression” has different meaning in programming context and formal language context. Regular expressions in regex libraries do more than match regular languages.

PCRE can recognize also all context free languages and some subset of context-sensitive languages. Just having backreferences makes the problem NP-hard.

gautamdivgi · on July 13, 2019

I thought NFAs and DFAs are equivalent, i.e. an NFA can be reduced to a DFA (at least what I remember of undergraduate theory of computation).

Firerouge · on July 13, 2019

Perhaps a parser exists that can determine if an input regex is runaway backtrack prone, and can automatically switch to a deterministic algorithm?

ehaliewicz2 · on July 13, 2019

Just check if it uses backreferences, otherwise it can be implemented via NFA/DFA.

ashton314 · on July 13, 2019

You might be thinking of push down automata, where N-PDAs are strictly more powerful than D-PDAs.