This is effectively just a nicer syntax than usual for regular expressions, but there are actual areas of semantics where regexp languages are ripe for improvement - most regexp libraries are stuck in the 80s/90s, not keeping up with recent developments in research.
For example, most regexp languages basically only have the traditional three operators of alternation, concatenation and closure/iteration. It's not necessary to stop there: the regular languages are closed under various other useful operators. Many of these operators don't even significantly increase the size of the minimal DFA.
For example, the complement (as in set/language complement) and reversal/mirror image would be some basic additions that seem like they should be necessary nowadays. Using them can make the regular expression much shorter, nicer and more understandable. For the complement, this should come at no cost for the minimal DFA size.
Some other viable operators are: intersection, set difference, merge (AKA shuffle), infiltration (also sometimes known as shuffle), interleaving.
Some other possibilites for regexp that could be put to good use more are:
* weighted regular expressions: these enable more power in a very elegant way, the idea is that programmers don't want just recognizers, so why limit regexps or finite automata to just that.
* JIT. It should be possible to compile regexps with libgccjit or with llvm.
> most regexp libraries are stuck in the 80s/90s, not keeping up with recent developments in research.
Can you show me the research that states how to add things like complement and intersection to general purpose regex libraries?
Most regex engines are backtracking based, and in that context, adding complement/intersection seems pretty intractable to me.
For the small subset of regex engines that are based on finite state machines, it's pretty much intractable there too outside of niche regex engines.
In fact, the research[1] suggests that adding things like complement/intersection is quite difficult:
> In particular, we show that when constructing a regular expression defining the complement of a given regular expression, a double exponential size increase cannot be avoided. Similarly, when constructing a regular expression defining the intersection of a fixed and an arbitrary number of regular expressions, an exponential and double exponential size increase, respectively, cannot be avoided.
And indeed, as another commenter pointed out, "minimal DFA" is effectively irrelevant for any general purpose regex engine. Not only do you not have the the budget to build a DFA, but you certainly don't have the budget to minimize that DFA.
With respect to reversal, RE2 and Rust's regex crate both do that. But mostly as an internal strategy for finding the start-of-match when using a lazy DFA. It's otherwise a somewhat niche feature more generally.
With respect to JITs, plenty of regex engines out there do that. PCRE comes to mind. So does V8.
As a general purpose regex engine author, we aren't "stuck" in the 80s/90s. There are just some fundamental trade offs at play here that make your ideas difficult to support. It's not like we haven't thought about it. Moreover, for things like complement and intersection specifically, actually reasoning about them in regex syntax is pretty tricky! I'm not sure if you've tried it or not. (There are some niche regex engines that implement it, like redgrep.)
>Can you show me the research that states how to add things like complement and intersection to general purpose regex libraries?
>Most regex engines are backtracking based, and in that context, adding complement/intersection seems pretty intractable to me.
Most regex engines, using backtracking, already have it in the form of lookarounds. The performance penalty of such a strategy is often quite significant. Lookarounds are hard to automatically convert into intersection and complement, but if such primitives were exposed to the programmer, they could often reimplement their regexes in terms of them. It might occasionally lead to an unacceptable performance cost, but there would be many cases where it would be a desirable tradeoff.
> Can you show me the research that states how to add things like complement and intersection to general purpose regex libraries?
Sorry, don't have time for that right now, but look into regular expression derivatives and similar stuff.
> Most regex engines are backtracking based, and in that context, adding complement/intersection seems pretty intractable to me.
Yeah, I wasn't really considering irregular "regexp", they don't make much sense in theory, so of course that extending them wouldn't make much sense either. Thing is, extending true regular expressions with operators like complement and concepts like weights seems like it could make the "theoretically pure" regexps more powerful *in practice* than irregular regexps (with backreferences, etc. Certainly more understandable.
> For the small subset of regex engines that are based on finite state machines, it's pretty much intractable there too outside of niche regex engines.
Don't think so. There are many examples, but mostly implemented in "functional" programming languages.
> In fact, the research[1] suggests that adding things like complement/intersection is quite difficult:
> > ...
You misunderstood the abstract. It's not saying that its difficult to translate extended RE (RE with these additional operators) to finite automata, it's just saying that translating extended RE to traditional RE can cause a huge blowup (something I already hinted at in the comment above). So this is a pro, not a con.
> And indeed, as another commenter pointed out, "minimal DFA" is effectively irrelevant for any general purpose regex engine. Not only do you not have the the budget to build a DFA, but you certainly don't have the budget to minimize that DFA.
It's possible to produce NFA directly from extended RE, see "Antimirov derivatives" for a start. The original Antimirov derivatives don't support complementation and similar, but there are extensions that do. Search for something like "partial derivatives regular expression complement", or "derived term automata complement".
> niche regex engines that implement it, like redgrep
> Sorry, don't have time for that right now, but look into regular expression derivatives and similar stuff.
Obviously, I have. There's a reason why there isn't a single general purpose regex engine that uses regex derivatives. They build full DFAs and full DFAs take worst case exponential time and space.
> Don't think so. There are many examples, but mostly implemented in "functional" programming languages.
Oh I think so. Feel free to link to some examples of general purpose regex engines that implement these things.
> You misunderstood the abstract. It's not saying that its difficult to translate extended RE (RE with these additional operators) to finite automata, it's just saying that translating extended RE to traditional RE can cause a huge blowup (something I already hinted at in the comment above). So this is a pro, not a con.
No, I'm not misunderstanding anything. The bottom line is that if you have an NFA with N states and you take its complement (how do you do that without first converting it to a DFA?), then you could wind up with an NFA containing a number of states exponential in N.
What's happening here is that you aren't quite appreciating what it means to engineer a general purpose regex engine.
All of these things you're talking about are all doable and implementable in niche regex engine libraries that serve a particular purpose. And it's all been done before. What I'm talking about here are general purpose regex engines, where you can't afford quadratic time during regex compilation (which is why Thompson's construction is so popular), nevermind exponential time.
And all of this is rooted in me taking objection to your characterization of regex engines being "stuck" in the 80s/90s. I'm trying to explain to you why that's not the case.
In theory, practice and theory are the same. In practice, they're totally different.
> It's possible to produce NFA directly from extended RE, see "Antimirov derivatives" for a start.
I know. I'm not talking about what's "possible." I'm talking about what's feasible from engineering perspective. As far as I know, building such an NFA takes O(n^2) time, which is a hard sell in a general purpose regex engine.
There are tradeoffs, for sure; but say one creates an API with separate procedures for compilation and execution, I think one could call such an engine a "general purpose one" even with the compilation taking quadratic (or even exponential) time.
> The bottom line is that if you have an NFA with N states and you take its complement (how do you do that without first converting it to a DFA?), then you could wind up with an NFA containing a number of states exponential in N.
Are you sure that a less naive approach doesn't exist? Genuinely asking, I'm not an expert. I don't quite remember the contents, but I know there are some papers with extensions to Antimirov derivatives (see updated upthread comment) that support taking the complement of an expression.
> There are tradeoffs, for sure; but say one creates an API with separate procedures for compilation and execution, I think one could call such an engine a "general purpose one" even with the compilation taking quadratic (or even exponential) time.
I don't think so. There's a reason why exactly zero popular regex engines build full DFAs. (Sometimes small DFAs are built in cases where you know they'll be very small or enforce a tight bound, but even then, the DFA is typically used as a prefilter rather than a regex engine unto itself.)
There are plenty of regex engines that do build full DFAs, but they're either relegated to the work of research or niche areas. re2c, for example, is an excellent regex engine that translates regex syntax into DFA source code. It thus has procedures for compilation and execution, but it is not what I would call a "general purpose" regex engine.
As another example, if I were to make Rust's regex crate use quadratic (or far worse, exponential) compilation, I'm pretty sure I'd have many upset users. (Now, I could use quadratic or even exponential compilation in cases where I'm sure that the absolute wall clock time won't be disagreeable, but I think it's safe to exclude such things for the purposes of this discussion.)
Even for DFAs that don't take exponential time, Unicode means that DFAs get obscenely big. \pL, for example, is 279 states and uses 143K of memory. And it takes around 8ms to compile on my machine. 1ms is already pushing it.
> Are you sure that a less naive approach doesn't exist?
Pretty sure. I'm not an expert in the theory either. I'm just aware of most of it. And I'm totally unaware of anyone who has overcome the practical engineering challenges of things like complement and intersection.
One of the papers that I was referring to above is "Derived-term automata for extended weighted rational expressions" by Akim Demaille, from 2016. I haven't managed to study the topic more closely yet, but the idea seems to be to compute the DFA lazily, only creating the states and transitions when they are required. No idea how well this can perform in practice, but this seems like a valid way to avoid the exponential blowup, and it's already implemented in the Vcsn (formerly "Vaucanson", I think) software.
Yes, that's what RE2, the regex crate and a few other regex engines already do. (Specifically, build the DFA lazily.) Everything I've said I've said with that full context.
I don't know, maybe just don't make grand pronouncements about regex engines if you aren't actually familiar with how they're implemented and the engineering tradeoffs in play?
It seems pretty clear from reading the above thread that burntsushi is saying that they have not been able to find a way to implement the extensions you suggested without making the compile time unfeasible for general use. Given how prolific their work in writing regex and other text-search related libraries is, it seems reasonable to take them that their word that they have in fact spent time trying to solve this problem. On the other hand, it doesn't sound like there's any any practical implementations that anyone can point to that can in fact add these extensions in a way that doesn't cause the compile time to be infeasible. Given that, you seem to be making the stronger claim, so it's a little strange to put the burden of proof on the other side of the argument.
I already told you. I'm done wasting my time speaking with you in this thread. In particular, you seem unwilling to acknowledge any specific trade offs that I've pointed out.
Wait, if one wants to match a complement of a regex, surely it is equivalent to seeing if matching the original regex would fail? Or do I miss something obvious?
I mean, yes? But it's kind of missing the forest for the trees. Regex searching isn't limited to "did it match or not," but also to finding where the regex matched in the first place. For example, let's say you want to match all words except for 'foo'. You might write, '\b(!foo)\b', where the '(!re)' syntax means "complement of 're'." That would then match 'bar', 'baz' but not 'foo' in 'bar foo baz'.
It's like how character class set notation has a 'negation' feature. e.g., '[a-z]' matches a-z, but '[^a-z]' matches anything except for 'a-z'. So you might say, "why bother with adding negation when you can just check if the original char class would fail?" The answer is that without negation embedded in the regex itself, you miss out on composition.
This is another reason why features like complement and intersection aren't particularly popular. They are weird to reason about. Another reason is that many of their use cases can be achieved through other means. For example, my example above could conceivably be done by iterating over all words (using word segmentation, for example) and just checking whether each matches 'foo' or not. It could also likely be accomplished through the use of negative look-around, commonly found in backtracking regex engines and not finite automata based regex engines.
Your proposed implementation makes sense for the case of merely taking the complement of a simple regular expression. But the point of having the complement operator available is to be able to take complements of arbitrary sub-expressions within the RE, and even having nested complements in the RE. In that case your proposed implementation would be relatively inefficient.
I downvoted it because you were talking to burntsushi, who has implemented an excellent general-purpose regex engine and is very thorough, thoughtful, and knowledgeable about it. When he said "Can you show me the research", that was him being exceedingly generous to your point of view. And when you said "Sorry, don't have time for that right now, but look into regular expression derivatives and similar stuff", that was you being the opposite of generous.
Thanks for voicing your opinion. That said, I don't see constructive criticism here. From what I gather I was basically at fault for daring to somewhat disagree with burntsushi.
> who has implemented an excellent ...
This is an appeal to authority.
> being exceedingly generous to your point of view
Now you're just insulting me!?
> you being the opposite of generous
Consider that I couldn't have known upfront that burntsushi is acquainted to any degree with RE derivatives and similar concepts. His regexp engine is not based on them AFAIK. So I pointed to those for a start. Later in the discussion, including in the same comment, I pointed to other sources, too. So I think it's not fair to call me ungenerous.
The whole "general-purpose regex engine" thing was moving the goalposts in the first place and not very relevant to this whole discussion anyway. EDIT: I'm not saying that these distinctions don't matter at all, rather what I want to say is that it doesn't make sense to implicitly call upon a subjective and arbitrary standard.
> Thanks for voicing your opinion. That said, I don't see constructive criticism here. From what I gather I was basically at fault for daring to somewhat disagree with burntsushi.
I don't think it's so much "what you said" or even "who you said it to" but "how you said it". Most of us are coming to this thread specifically because we _don't_ already have specialized knowledge in this area, and we were hoping to learn something. It's totally reasonable not to have time to explain something this complicated, but it comes across as a bit disingenuous when followed by many additional comments in the thread continuing to debate in what looks to be close to real time. I think in this case, people are downvoting more for a breach of "decorum" than for anything else; you might very well be correct in your assertions about regexes, but it's impossible for any non-experts to tell, so we end up seeing (fairly or not) picture of someone who is more interesting in spending time arguing that they're right than actually demonstrating it.
> From what I gather I was basically at fault for daring to somewhat disagree with burntsushi.
Not so much for disagreeing. More for engaging a lower-effort way (than the person you were talking to) and appearing to assume that burntsushi was coming from a place of ignorance.
> The whole "general-purpose regex engine" thing was moving the goalposts in the first place and not very relevant to this whole discussion anyway. EDIT: I'm not saying that these distinctions don't matter at all, rather what I want to say is that it doesn't make sense to implicitly call upon a subjective and arbitrary standard.
"general-purpose regex engine" is a direct response to "most regexp libraries are stuck in the 80s/90s". It's about why the regex libraries most people use (the general-purpose ones) haven't adopted the ideas that you're advocating. Seems on-point to me; on HN most readers will be mainly familiar with the general-purpose libraries, so they'll be thinking in terms of those.
I understand how you might think this. However, please consider that the statement you were responding to is establishing a bonafide - a reason why burntsushi's words might be worth consideration.
Sometimes an appeal to authority is not wrong.
In my experience, the more I study a topic, the smarter other folks appear.
> * JIT. It should be possible to compile regexps with libgccjit or with llvm.
Although not JIT in that it needs the regex specified at compile time, both Rust and C++ (using the compile time reflex library) can generate the code to evaluate the regex at compile time.
The regex crate (for Rust) does not have this capability. It used to many years ago via the regex_macros crate, but I abandoned it mostly because of not having enough resources to dedicate to it and its somewhat niche usage. If you need compile time regexes, it's better to use something like re2c.
The reflex library is interesting, but more like a C++ template hack. Its codegen is impressive, but it's missing (last time I checked) a lot of the classical optimizations that a regex engine typically has. Like a literal prefilter scan.
> It used to many years ago via the regex_macros crate, but I abandoned it mostly because of not having enough resources to dedicate to it and its somewhat niche usage.
As an aside, my impression is that Rust developers are using macros less now than they were even a couple of years ago. It seemed a few years ago, a lot of people coming to Rust were impressed by the power of macros (especially procedural) and used them in a lot of libraries. It seems now, people are realizing the costs of macros to tooling and understandability, and are using them more judiciously.
Note that "minimal DFA" is not very relevant these days, with Google's `re2` being the only implementation I'm aware of that implements a DFA. All others sacrifice predictable performance for feature parity with Perl.
It's certainly not the only one. Rust's regex crate (of which I'm the author) and Go's regexp package both use finite automata.
It should be noted that none of them build out complete DFAs in the general case. Building a DFA is worst case exponential time and space, and they can in practice be expensive even when not hitting the worst case.
Instead, they use different techniques. RE2 and the regex crate, for example, use a hybrid NFA/DFA that does subset construction at search time. Outside of pathological cases, it provides the speed of a DFA without needing to build the whole thing up front.
Hyperscan is also built on finite automata. GNU grep uses finite automata in many cases too.
Not a single other implementation has feature-parity with Perl. Perl regexes are not only Turing-complete, but can also execute arbitrary code¹ as part of the regex evaluation.
PCRE is the pinnacle of deceptive advertising. It is emphatically not Perl-Compatible.
Source: I used to write Perl for a living, and I have battle scars from (¹) above.
When Sun released Java and went on a marketing road show, I drove to Chicago to hear the presentation. Since Perl was my favorite language, I had to ask about regular expression support; the Sun engineer said there wasn’t any, but I could build the first library.
To me, it was inconceivable that you could release a major language and not only not have regular expressions built into the language like Perl, but not even have it in the standard library. Crazy.
When I found out that Jonathan Locke had even a minimal Java implementation of a regex engine, I jumped on it and messaged him to open source it to the Jakarta project. Crazy that that was 20 years ago...
We did the same thing for Rust. Regexes are neither in the language nor in the standard library. Go doesn't have them in the language either (although they're in the standard library).
Sorry, early morning. I should have been more verbose.
Rust has a different target problem space, a different standard library philosophy, a package management system, and exists in a world where adding external dependencies is trivial.
It can, but this isn’t a hard timeout. From the description (“The MatchTimeout property defines the approximate maximum time interval for a Regex instance to execute”), it seems they have their regex engine periodically (say after handling every 10,000 instructions of the compiled regex) check whether they went over the limit.
There’s also the postgres regex engine, I don’t know how it works under the hood but it’s very resistant to catastrophic backtracking despite features which normally translate to nfa (e.g. backrefs).
> features which normally translate to nfa (e.g. backrefs)
You're very wrong, backreferences in general can't be translated to finite automata (NFA is short for nondeterministic finite automaton). See these comments:
On a related note, if you have Python regex code that you want to make more stable/performant, https://pypi.org/project/pyre2/ is a drop-in replacement for `re` that (configurably) falls back to `re` if you use lookaheads, etc.
Various regexp implementation possibilities do exist, whether in software or in hardware. I think my general point stands independently of which regexp implementation is used (although the comparisons get hairy if you include the irregular expression "regexps").
Yes and straighforwardly so if you use character classes as your basic building blocks. Here I implemented a Haskell implementation that is easily extandable to include complements: https://github.com/dan-blank/hgrep-smallcore (I like this project because it translates ERE compliant regexes - sans negated character sets - down to only 4 constructs, one of which being character classes). It implements https://www.ccs.neu.edu/home/turon/re-deriv.pdf, character classes are described in 4.2.
I actually had complement in it as a 5th construct, but when the submission came closer and the examiners found some errors in my logic (my fault for not writing good enough unit tests!), I took complement out again when cleaning the project up.
I tried to test your program because I'm pretty sure your techniques can't be used in a general purpose regex engine. (I've long wanted to make use of regex derivatives somehow, but I don't think it's feasible because of the downsides of building up a full DFA.) More to the point, I also suspected that you might be using a sparse representation for transitions in your DFA if the alphabet of your DFA is indeed all of Unicode. This is also problematic because it tends to make search time quite a bit slower.
In any case, I built your program with 'stack install' but got this when I tried to use it with Unicode:
I think you should probably cut out the ERE thing entirely. Or maybe say it's "inspired" by ERE. Otherwise, it's not really that close at all. Notice that not even 'hgrep-exe '[a-z]' /tmp/b' worked for me either. That's certainly in EREs. So are things like '[[:word:]]', which also doesn't seem to work. There's also collating symbols and equivalence classes and some other junk.
But upon looking at the POSIX ERE spec, yes, it looks like technically things like \w, \d and \s are not in it. But most ERE implementations, including both BSD grep and GNU grep, support constructs like \w. (Yet another reason to cast a suspicious eye on folks who obsess about portability. Portability means following a spec, not using whatever your tool lets you do. And most tools let you do far more than what's in the spec because the spec---especially one like POSIX---is usually divorced from the reality of what's useful.)
I understand your tool is a university project. The main point I'm trying to drive home here is that there are folks in this thread that seem to not be keen on acknowledging engineering challenges and are instead only looking at the theory. Unicode, for example, is an enormous engineering challenge. It's not difficult because getting it correct is difficult, it's difficult because making it correct and fast is not straight-forward. As you show, using a naive representation (sparse transitions, hash sets for states) will give you correctness without much complexity. But that's not usually what we mean when we talking about supporting Unicode in general purpose regex engines. Because "general purpose" means folks expect it to be minimally fast.
Ah! I was only referring to ERE as specced by POSIX, yes.
> It's not difficult because getting it correct is difficult, it's difficult because making it correct and fast is not straight-forward.
It seems to me that you take my project as a proxy for whether regex derivations are a feasible way to deal with unicode-ready regexes that also support complements and so on. That was not my intention, I was merely attempting to show an easy implementation of regex derivations that can deal with unicode and can be extended to support complements. With this project and the paper I linked, it seems to me that answering whether this particular kind of constructing regex DFAs is a possible way to achieve what you are looking for or not should be rather straightforward.
(To my last knowledge, a regex complement is not easy to add in the presence of extra features like backtracing and lookahead.)
> With this project and the paper I linked, it seems to me that answering whether this particular kind of constructing regex DFAs is a possible way or not should be rather straightforward.
Yes, you're correct. It just isn't that interesting because it will fall over in any kind of real practical usage. :-) Full disclosure, I'm the author of ripgrep, so I have some particularly relevant experience in the specific domain of making regexes work well for the masses. (It also forms my bias with what I care about.)
Like I said, I wasn't necessarily trying to pick on you. There are others in this thread that are seemingly ignoring the engineering challenges, and only looking at the theory. Then using that as a basis to proclaim that regex engines are "stuck" in the 80s/90s.
Consider my perspective here: if someone pipes up and says, "yeah hey it is actually easy to add those kinds of features to regex engines." Well, then, it's important to include the caveats with that. That's what I see as my role in this thread. Otherwise, it looks like general purpose regex engines are crippled for no apparent reason.
Oh well, first of, thank you for writing ripgrep! Great work, love that tool and use it daily :)
All good, thank you for sharing your POV! I find both theory and the practical engineering feats quite interesting, and actually understood the original question as a theoretical question, not a question about general-purpose regex engine. Quite possibly that this is even how the asking person was intending it, in hindsight.
Last thing on topic (talking now about having complement and unicode in a usable regex engine): My personal, gut-feeling-based hunch is that by adding all these features not strictly related to regular languages, we kind of evolved our regex enginges and expectations towards them into a corner of the design space that works well, but might not work well when one adds complement. It feels like a constructivist world without negation and stuff on top that works, and reverse-engineering what works with negation and what not will likely be hard work. As a theory-ish kind of person (without much insight into regexes or formal grammars), I wish there would be some kind of insight into what are the atoms of a regex and which atoms can be combined witch which other atoms - I would bet that fancyFeatures and unicode+complement would have a high algorithmic complexity already in theory.
Any final tipps on the repo I linked to not disappoint users expectations? :)
I think you got it right by putting "university project" in the description. :-) I wrote my fair share of projects like that too.
The incentive structure of academia that leads to software like that, though, is also a big reason why I left. The theory is super important (that's why I work on a regex engine that only provides support for regular languages), but I also think the engineering aspect is just as important. And academia, or at least the corner I was in, just did not care about the engineering side of things at all. This in turn leads to publishing results that are hard to reproduce, because if you don't share code that is at least somewhat robust, it's going to be very costly for someone else to build on your work.
Anyway, sorry about the side rant haha. But all good.
All good! And sorry for not making the effort to understand your position first, I want to work on that.
> The incentive structure of academia that leads to software like that, though, is also a big reason why I left. The theory is super important (that's why I work on a regex engine that only provides support for regular languages), but I also think the engineering aspect is just as important. And academia, or at least the corner I was in, just did not care about the engineering side of things at all. This in turn leads to publishing results that are hard to reproduce, because if you don't share code that is at least somewhat robust, it's going to be very costly for someone else to build on your work.
Totally get it. This also baffles me - papers in computer science that are hard to reproduce? Yes I know, publish or perish and all that BS. But I don't want to accept this: I mean I get it for psychology obviously, and then also material sciences and so on, but for compsci?... Next to proofs, programs are probably one of the most reproducible things that nature has given given us.
Good for me and many others that you cho(o)se to spend your time on building tools :)
> Full disclosure, I'm the author of ripgrep, so I have some particularly relevant experience in the specific domain of making regexes work well for the masses.
> it looks like technically things like \w, \d and \s are not in [POSIX]
The relevant standard is UTS #18, it subsumes POSIX afaict. Do you think the same as me about it, namely that following and implementing it is essential?
We were talking about EREs, which are an artifact of POSIX, not UTS#18. So the relevant standard for this specific conversation is POSIX.
To redirect to UTS#18, I don't think UTS#18 subsumes POSIX. UTS#18 doesn't support [[=a=]] for example AFAIK. And UTS#18 more generally doesn't require locale support. UTS#18 Level 3 was actually removed from the spec, which is where "custom tailored" logic for specific locales used to live. On top of that, POSIX also specifies BREs which UTS#18 doesn't touch. So while there is overlap between POSIX and UTS#18, POSIX is not a strict subset of UTS#18. If you're speaking "conceptually" and less precisely, you can maybe say POSIX is subsumed by UTS#18 though. I don't really think about it that way though personally. They serve two different use cases that are still relevant today.
Mostly irrelevant but those runtime errors are the main reason why I stopped using Haskell. Haskell is a great language with a great booming ecosystem, but if you can have uncaught exceptions in IO monads, and even unchecked errors in your pure code like that fromJust there, then what's the point? For all its ceremony is not much more functionally pure than JavaScript. And a language without purity as a concept like Rust has achieved frankly a more tight feeling of safety and great flexibility despite having a more simple typesystem. Imo Haskell has to drastically change like it did in the early 2000's to become competitive again.
I'm unconvinced by such things. From what I can tell, this is academia code. In academia code, regardless of programming language, folks often don't care about failure modes. (I speak from experience, as someone who used to be in academia.) So I'd be more inclined to pin it on academia's incentive structure than anything about programming languages.
I suppose you could use a programming environment that forbids partial functions, but I'm unclear on how productive that is.
> And a language without purity as a concept like Rust has achieved frankly a more tight feeling of safety and great flexibility despite having a more simple typesystem.
Haskell has 'fromJust' and Rust has 'unwrap'. They're both exactly equivalent and result in similarly poor failure modes. (Well, 'unwrap' usually at least gives you a line number.)
Yeah, but even better Rust has `except`, but that's besides the point. Rust doesn't pretend to by a pure language, it just promises to be a safe language as long as you don't use `unsafe`. Haskell makes a big fuss about how it's pure and productive because you have fewer bugs. And this is true as long as you're writing pure data processing or algorithmic code. But as soon as you stray from that and try to build a practical program, you'll need the IO monad and it sucks, even Java deals better with exceptions than Haskell. All the type class expressiveness in the world, and they couldn't figure out how to model exceptions. And it used to be that Haskell could just introduce backwards incompatible changes because it was a research language. But now Haskell is used widely in production, and even though it's in desperate need of a new Prelude, it's not gonna happen.
When someone who truly believed in that Haskell could be a proper modern programming tool and invested their hard work and frankly their genius into it, they Haskell leadership responded with skepticism and inaction. I bet if you gave Snoyman full dictatorship over a new Haskell prelude, it would be the best general purpose programming language by a mile for the next decade at least. Instead Haskell will remain as backwards as its ISO 8859-1 String type.
Yeah I think you're going places I don't care to follow. We were talking about a specific program, and I made the argument that the behavior of the program was more sharply determined by the incentive structure in which it was written. But you're kind of doubling down into a more general rant about Haskell that is actually totally divorced from the failure mode of the Haskell program in question.
I used to use Haskell too and I stopped using it for $reasons. Nothing to do with the prelude or its string type though. More about the nature of the language and paradigm itself.
Hold on! See: https://github.com/dan-blank/hgrep-smallcore/issues/1 Using `fromJust` is deeply frowned upon, and indeed a few eyebrows were raised then showing my code to the examiners. This and a handful other warts unfortunate, but few and easy to work around.
This function and others are in the prelude, but one can use other preludes that don't have these escape hatches.
This does not undermine the huge benefits that people get from having IO checked by the typesystem. Just as having tyepcasting in a language does not undermine the benefits of types in that language.
Nah, the type system is perfectly fine in this case: "Given (Just someValue), return someValue". Exactly what happened here, the precondition just was not fulfilled and I left the case dealing the unfulfilled precondition undefined. If one cares about such things, one should simply use one of the numerous alternative Preludes out there (to make sure to actually never use fromMaybe, head and the 3 other functions that nobodoy uses form the standard Prelude) and turn the warnings for incomplete pattern matches on. At least, thats how you could deal with it in real life.
If it is not practical matters that are the concern here, there are more than enough languages that only allow total functions and offer other sweet stuff. Haskell's main purpose was to be a lazy-by-default-language, and that people can actually write practical stuff in it is a nice side-effect.
Of course, Unicode is just another alphabet. Specific expressions might be problematic though, depending on the implementation.
EDIT: If malicious inputs are something to be worried about, I think a good way to do this would be to check for exceedingly costly blowups of various kinds during/after translation to a finite automaton (so before automaton execution). This would enable having complemented expressions when safe (in lots of cases, at least).
To get something useful and not attackable, the complement operation would need to take the complement inside the set of valid UTF-8 strings rather than the set of all byte-sequences, which is probably doable but I'm not sure how straightforward it would be.
I'm curious if this does concretely improve regular expressions or not for generic programmers. Overall, to me it seems the classic minor-but-not-radical improvement.
The major difference I see is that this is a sort of `/x` modified regex with better readability. This is a significant improvement - the big problem with `/x` is that one then needs to escape/encode whitespace, which typically offsets the improvement.
The other significant change is variables, however, when one reaches a certain complexity, IMO they should not use regexes anymore.
The other differences (based on the home page presentation) are cosmetic changes of the same semantics.
The fundamental assumption is that the main problem of regexes is about syntax rather than semantics. It's true the regexes are somewhat ugly and hard to remember, but, personally, I don't think that's the main problem. It would have probably been nice if regex started with this syntax, but integrating them in a new project is a difficult decision.
Are you honestly implying that there are still people who, in all seriousness, use a REGEX to parse HTML?
Every programmer I have met, however junior, knows or learns about Zalgo early on. It's like the "foobar" wrt regexps: common knowledge.
True, knowing it's a bad idea won't hold people from doing it. I do it. I'll whip up some stupid `:%s/\<?p class="strong"?>/<strong>/g` occasionally in vim, knowing perfectly fine that this regexp won't be solid, but: works for this case. All the while knowing about Zalgo for years (decades?)
Yes, but the specific problem is that it's not general knowledge exactly why you shouldn't attempt it.
Ask a handful of developers and see how many mention Chomsky Type 3 grammar, or even regular grammar.
And unless you scope an understanding of what regexes are to that, which is non-trivial, you get some bizarre feature requests, because people are attempting to use it inappropriately.
From a quick glance Rulex looks very similar to Eggex!
A difference is that Eggex is embedded in a shell so you can use normal assignment statements to build up subpatterns. And you can also interpolate directly into an 'egrep' or 'awk' command.
Funny thing is that I designed a similar project called "Annex" in Python around 2014.
But when I looked back on it after a few years with fresh eyes, it was just too different! It was hard to remember.
So Eggex is very similar to the common syntax -- it's actually roughly the same as lex or re2c. I think there's a case to make that we we ALREADY have a good syntax for regexes in lexer generators -- it's just the POSIX/Perl syntax that is bad!
The other thing about Annex is that it was a Python library, and it was never worth the extra dependency. (IMO PyPI has a lot of friction.)
That is, after writing it, I already knew stdlib re syntax so well it wasn't worth it :) But Eggex is in a shell, which is a context that makes a lot of sense IMO.
first one running code in the middle of the regex, to see if it matches a mathematical property, the second, running code to generate a [0|1|2|3|...|254|255] equivalent, which also matches as tightly as the rulex/regex code (e.g. no 00 allowed)
None of those seem much easier to understand than the others. I suspect this is a classic example of "let's take something that is really messy in pcre and compare it other more brief equivalents" but I could easily turn the regex into something more readable and I can find more help for the standard regex syntax than I can for any alternative.
As other's have said, it is a minor arguably cosmetic improvement and something that is likely easier just to use a visual helper like regex101 to help you construct.
It can end up taking a little longer to write (same with all of these alternate syntaxes), but I think it significantly reduces the WTF-factor when you have to try and read it again in weeks/months/years.
I think it's worth it for any non-trivial regexp and I miss it whenever I'm using any other language, so I'm glad that this thread is full of alternatives :)
Not OP but I have switched to using rx macro for everything and I never miss regex. I miss rx instead everywhere else. Here are couple of blog posts someone might find useful:
I haven't used rx in Elisp specifically, but in my work in R, I've started using the rex package almost exclusively for writing regexes. It's the same idea: you build a regex programmatically instead of as a string, so you no longer need to think about syntax, only semantics, because the syntax is the same as the host language.
Just looked at Rulex briefly... am I the only one that thinks it isn't obviously more readable or easy to understand? Maybe it's because I've taken the time to master regex, but this syntax seems just as opaque as terse regex syntax was to me prior to digging in.
You are not there only one. REs look like gibberish until you understand them. To me Rulex looks like gibberish but the REs they turn into make sense. I don’t think the syntax is inherently more readable, and the whitespace insensitivity + comments feature has been in Perl and Ruby REs for decades.
I think the main appeal is its ability to transpile to different languages, so that you can use whitespace insensitiveness + comments in other languages where those features are not supported.
I really like this. One big benefit of this notation, that I feel is under-highlighted in both this comment page and the Rulex site, is that there are no backslashes in Rulex.
A lot of regex usage is inside strings inside another programming language, which typically means double backslash escape messes such as \\\\" and it's just a mess. This makes regexes particularly beginner-unfriendly, because you gotta parse it out at different levels.
No backslashes means that you can put a Rulex expression in a string in just about any language, not escape anything, and still get a predictable result.
I suspect that a lot of its design actually follows from that requirement. Eg having to quote all character literals/sequences - that's needed because that way, eg a newline can just be "[n]", even inside a character class. (eg ['a'-'z' n]).
Also, it looks like you can choose to use either single-quotes or double-quotes. This means that if you use the one style of quotes in your language, you can use the other one in the Rulex expression and still not have to backslash-escape anything.
This has an almost uncanny-valley effect, in that yes, this is an improved syntax for regular expression, but that has the effect of highlighting why PEGs are the way forward here.
It's great that you can assign variables in Rulex, but they can't reference themselves recursively, so you're still locked out of all the interesting formats. Meanwhile it weakens the "ok but everyone knows how to use regex, it's in the language, and we're just doing a bit of data munging" argument for choosing regex, and you know what? That's a good argument, and it might be the only one.
Finally someone else trying to fix regex syntax approaches it from the same angle as mine[1]: a portable syntax that transpiles to plain old regex to get bug-for-bug compatibility but a nicer syntax.
Instinctively, I think the syntax didn't go far enough and could be made more readable (of course I'd feel that way!), but the implementation and marketing has me beaten! Kudos!
1. if you know regex but for a particular language, this takes care of having to learn the different variants of regex because it will compile down to the version you want.
2. it is a more approachable syntax, so easier to learn.
3. because it is still a regex syntax given that you probably know a bit of regex already you can get better at it by learning this.
Note that rulex replaces the named backreference with a numeric one. Named backreferences are much easier to work with, because you don't have to count capturing groups, which is error-prone and complicates later modifications.
An more accurate translation of the above into a (PCRE) regex would be
(\s)(?P<named>.)\k<named>
However, named backreferences aren't supported in JavaScript. Rulex makes sure that the emitted regex is correct and supported by the target regex flavor.
After looking at all the examples nothing stands out as particularly appealing. It ends up being just a more verbose version of regular expressions, but with less clear semantics, harder to see the underlying constructs. Over the years I've become quite familiar with regexp so maybe I'm just biased, but I'd rather have something like CoffeeScript's block expressions instead, where you can easily group and document each part:
> In Rulex, whitespace is insignificant, except between quotes. This means that we can add spaces and line breaks to make the code look clearer. We can also add comments to explain what the expressions are doing. They start with a # and span until the end of the line:
Ok, despite being more in the now you have two problems camp when discussing regular expressions I find this interesting enough I should schedule learning it. Question - when compiling I suppose it can turn the regex into the most optimized form - so generation of character classes, lazy quantifiers if pertinent?
// A regex for extracting a currency (dollars or pounds) and amount from input
// with precisely the form /[$£]\d+\.\d{2}/
let regex = Regex {
Capture { /[$£]/ }
TryCapture {
/\d+/
"."
/\d{2}/
} transform: {
Amount(twoDecimalPlaces: $0)
}
}
That gives you control over the verbosity.
Aside: all of this is nice, but does anybody know whether compilation speed (which can be notoriously bad) improved in the latest Swift beta?
The design of the examples is weird and unintuitive. At the very least should have some sort of clarifying header of “new” and “old” or something. It took me a solid 30 seconds of confusion to realize the regex was the old syntax rather than the new syntax.
The arrow points from left to right, and arrows represent progress but that’s backward from their intent. The new syntax should be on the end of the arrow, not the beginning.
If this were a regex transcoder and the base were the input syntax and the tip were the output code it could kinda make sense, but best I can tell that is not the case here.
The only part that I can agree to is that it's useful to compose a larger regex from smaller easier-to-understand ones.
But I don't see why a new regex engine is required for that. On the other hand I'm probably overlooking some important features that they bring to the table. Maybe I'll come back with more comments if I invest more time looking at Rulex.
Would a way to list examples be able to be added to this? I guess it might be possible just with comments and a preprocessor but it would be great to have a way to specify a list of matches and non-matches and know that the regex works with them at 'compile' time. Probably works as inline documentation too.
And is there a standard file extension and syntax colouring available for editors (yet)?
When compiling a rulex to a regex, you specify the regex flavor, so the resulting regex is compatible with that regex flavor. This is significant because PCRE, .NET, Ruby, Python, JavaScript, etc. all use slightly different syntax, but you can use the same rulex to target all these different regex engines.
Looking at examples, I can't see where's improvement tbh. Both classical regex, and this new kid equally require training to read because they are both terse. Adding whitespaces, and comments doesn't bring novelty - using your PL of choice syntax you can already split regex in smaller chunks for readability.
- Regular Expressions - very popular, occasionally reads like line noise, backslash for escape
(?:What is your (?:name|quest|favourite colour)\?\s?){1,3}
- https://github.com/SonOfLilit/kleenexp - Terse, readable, almost never needs escaping. Python compiler, almost ready VSCode extension, online playground.
[1-3 'What is your ' ['name' | 'quest' | 'favourite colour'] '?' [0-1 #space]]
- https://github.com/yoav-lavi/melody - More verbose, supports macros, backslash escapes only for quotes. Rust compiler, babel plugin. Improves with time, getting quite impressive.
1 to 3 of match {
"What is your ";
either {
"name";
"quest";
"favorite color";
}
"?";
0 to 1 of " ";
}
- https://rulex-rs.github.io/ - Very similar to legacy regex syntax, supports macros and number ranges, supports unicode, _amazing_ error messages help convert legacy to new syntax, backslash escapes only for quotes. Rust compiler, as of today no built in way to use outside rust (but they seem to be planning it).
('What is your ' ('name'|'quest'|'favorite colour')'?' [s]){1,3}
- https://www.oilshell.org/release/latest/doc/eggex.html Part of a new Unix shell's syntax. Big on composition (macros in kleenexp). Uses backslash for character classes. Production-ready within the shell, not supported elsewhere yet.
/ 'What is your ' ('name' | 'quest' | 'favorite color') '?' ' '? /
- http://verbalexpressions.github.io/ - Embedded DSL, supports 14(!) languages (to some extent? I didn't verify), but don't seem to have syntax for `(a|b)` and `(?:...){3}`
const tester = VerEx()
.then('What is your ')
.either( // this doesn't seem to be implemented yet (?), so I'm proposing a possible syntax
VerEx().then('name'),
VerEx().then('quest'),
VerEx().then('favorite color'),
)
.then('?')
.maybe(' ')
.multiple(3); // not sure this is the correct syntax or how to use it in more complex scenarios, hard to tell from tests and discussions in Issues
- There are many more eDSLs, but I will not list them as they are less relevant in my opinion
If you want to see something really neat check out Prolog DCGs:
> A Prolog definite clause grammar (DCG) describes a sequence. Operationally, DCGs can be used to parse, generate, complete and check sequences manifested as lists.
This project looks great.
Occasionally I need to write a reg expr (JS developer) and I am not that good at it and I always end up googling around.
I am thinking it would be much nicer if I could use rulex to generate my reg expr.
But is that what it is supposed to be for ?
cheers
Yes, rulex is compiled to a regex. You can use it with the CLI or the online playground and insert the result into your source code. Eventually, I'd like to have a babel plugin that can transform a rulex`...` literal to the equivalent regex, but we're not there yet.
Got to admire someone who is going to invest time to invent a whole new language just to avoid mastering an existing one. Regex is fine. Been using it nearly daily for the last decade. You get used to it. These days, there are tools that make it incredibly easy to write, test, version and document regex, like https://regex101.com/.
It's all too clear that they master the regular expression syntax, because they explain their own in terms of it. And I for one have a hard time reading it.
Regex is not fine. It's been recognized as a source of bugs for decades. Fredrik Lundh's famous quote: "Some people, when confronted with a problem, think ''I know, I'll use regular expressions!''. Now they have two problems."
Claiming that nothing needs fixing here is a lot like claiming we don't need Rust because the C prepocessor does all we need, or that we don't need any fancy build tools because we have GNU autotools. I really see that sort of comment anymore anywhere but Slashdot.
The issue is that ideally you use regex as little as possible anyway, so it's not worth the effort of fixing it and pushing for something better.
The main problem with this proposal is that they use almost no time explaining why they've made the changes they made, and even to understand WHAT the changes are you have to be good at reading already existing regexes (which is harder than writing them!)
That said, one change I spot which I heartily approve of, is reversing the quoting assumption. In regular regex, strings by default match themselves, and need to be escaped to get special meanings. This is a poor design because if you reach for regexes in the first place, it's the special meanings which are important, not the literal matches. Even the first change to regex (egrep) partially recognized this, by reversing the escaping assumption on ?, +, | and parentheses. Going all the way would have been the better decision, and I'm glad to see the authors of this project agree with me on that.
> "Some people, when confronted with a problem, think ''I know, I'll use regular expressions!''. Now they have two problems."
This is a sort of funny tongue-in-cheek jab, not a factual statement. I'd chuckle if I heard someone say it in real life, but also eye-roll internally. There are some people who will write lines of substr code to trim whitespace from a string. That is the type of person I would expect to crack this joke.
Regular expressions are fine, just like an oil filter wrench is fine. I wouldn't use an oil filter wrench to tighten a nut/bolt. I wouldn't write a single regular expression to validate an email address or implement a lexer.
While I respect the effort put into making this project, I see it mostly as a distraction for new developers. If you are confused by regular expressions, you may be trying to do too much with them and/or you need to put some effort into mastering them (I recommend Jeffrey Friedl's Master Regular Expressions).
If you try to pretend regular expressions don't exist, you are going to have a bad time. You will encounter them, unless you never touch existing code during your tenure.
I also don't think this syntax is better. Regular expressions are terse, and that is a feature. It may feel like an anti-feature in larger expressions, but the secret is that if you are writing huge regular expressions, you are probably doing something wrong.
I think the point of that quote is less about error prone regexes and more about regexes not being the right tool for many parsing problems. A new regex language doesn't really tackle that right?
> Got to admire someone who is going to invest time to invent a whole new language just to avoid mastering an existing one
Wow, that's unnecessarily inflammatory.
This is the first I've ever heard of Rulex, yet it takes no more than a glance at the homepage to realise the author would have more mastery of regex than 99.99% of developers.
Rulex transpiles to regex, running in a variety of engines. Meaning the author has not just mastered one regex dialect, but the subtleties of all supported regex engines (PCRE, JavaScript, Java, .NET, Python, Ruby and Rust).
> Rulex transpiles to regex, running in a variety of engines. Meaning the author has not just mastered one regex dialect, but the subtleties of all supported regex engines
Not necessarily. More likely, Rulex supports the common subset of features of those engines, with, where/if necessary, small syntactical differences in the output, ignoring many subtleties of implementations.
It's a middle ground, actually: When compiling a rulex, you specify the target regex flavor. Some features (e.g. lookaround and backreferences) aren't supported in every flavor. Some features are polyfilled by rulex, so they work even when the targeted regex engine doesn't. However, there are also features supported by PCRE that aren't available in rulex yet. Notably, atomic groups, recursion, character class intersection/subtraction and some mode modifiers aren't supported yet. Eventually I'd like to support all of PCRE's features.
>Got to admire someone who is going to invest time to invent a whole new language just to avoid mastering an existing one.
Yes, because any existing language is the global optimum, and we should always strive to reuse it, never invent any new one...
I mean, what's the problem with stone cavemen-era cart-wheels? Why reinvent the wheel?
Not to mention the implying that the only reason one would try to invent a new language is to "not invest time mastering an existing one". I.e. "It's not that this guy tries to improve regular expression languages - it's that he's dumb/lazy to master the standard regex expressions".
Never mind that this new language also transpiles into regular regex, so clearly the person who created it has also "mastered the existing one".
Working with a company that writes hundreds of regexes per day, I can say with confidence: regex syntax is _not_ fine. People with lots of experience make obvious mistakes and lose them in the line noise all the time.
1. Very steep learning curve for beginners, especially if they have to read a regex they didn't write themselves
2. Write-only language, just like APL: dense and concise, but a giant PITA to read once you've walked away from a complex regex for a couple of days.
3. Not standardized (sed regex vs. sed extended regex vs. perl regex vs. vim regex vs. bash regex vs. globbing vs. whatever else, it's never the same, what a giant effing PITA)
4. No proper integrated debugger (there might be some 3rd party out there, but it's never around when you need it, as in: integrated in whichever environment you are currently doing regexes in). When your large and complex regex fails good effing luck trying to figure you why, short of breaking it down in tiny pieces and debugging each in turn recursively. You don't even have access to a bloody printf.
5. Boolean trees of regexes (match blah OR blech AND NOT blargh, where blah, blech and blargh are themselves boolean trees of regexes) are a giant PITA to express within the language.
6. Overlap and confusion between capture '()' and factoring out an expression to plug into a dyadic OR.
7. No way to factor out common and/or repeating parts, like you would in a normal programming language.
So any attempt at improving this mess is welcome in my book.
> So any attempt at improving this mess is welcome in my book.
Already done, I think. :)
2. Not a big problem if the responsible programmer employs named captures and named subpatterns, and switches on extended mode and makes use of whitespace and comments. Like with any other programming environments, it's a matter of wanting to write readable code, and perhaps enforcing this through technical and social means.
3. I observe that Perl-style regex have become the de facto standard, in the sense that numerous programming languages in common use have adopted it, and preceding styles are marginalised to the point that any discussion about regex that does not mention the style makes the assumption of Perl-style.
The next points are actually implementation dependent. It is not fair to condemn regex as a whole just because some implementations fall short.
4. Perl has good debugging tools for regex.
5. This is easy to express in Perl.
6. Non-capturing grouping exists.
7. Named subpatterns exist, interpolation of variables containing regex values exists.
Perl is a dying language. Your 4-5 "solutions" are non-solutions in practice for 99% of software developers out there.
And regexps are by default written in the concise form. It's the standard problem of "opt-in" vs "opt-out".
Opt-in, as a rule of thumb, has low adoption rates, for various reasons (maybe 20% if you get lucky).
Opt-out is the inverse, it has adoption rates of 90%+, except for total, universally known disasters.
Tools which are unsafe by default are bad.
Yes, the craftsmen are bad, but the way we improve is by also improving our tools. It doesn't matter how much pointing at the idiots we do, practical regexps have been around since ~1975, that's almost 50 years. In industrial use they haven't gotten safer.
I'd agree that we get used to regex with usage, but need to be extra careful to test all the corner cases (since by design, a pattern matches multiple things).
Creating rulex has been quite a journey, during which I learned a lot about regex syntax, its quirks, and the differences between various regex flavors. I already knew a lot about regexes before I started this project, but creating a language that compiles to regexes requires a much deeper and more thorough understanding of them. I didn't create rulex so I don't have to master regexes. I created it so that people who use it don't have to master regexes -- a language that is riddled with poor design choices. Yes, you can get used to it, just like you can get used to working with a dull plastic knife. The fact that people need a website such as regex101 or regexr just shows, in my humble opinion, that regular expressions aren't as straightforward or easy to understand as they could be.
I want to allow users create and share RegEx's in my website, but I'm afraid there could be some abuse like building a very complex one that would get Javascript stuck or crash. Are there any known vulnerabilities and RegEx sanitization?
Regexes often contain complex parts that repeat themselves throughout the regex or can be combined into higher level patterns. This adds support for that.
Looks like it would be good if you had to use some ancient regex engine.
Otherwise, I find regex is extremely easy, since the problems it solves well are very simple, and one can develop it interactively by trial and error, by looking at the highlights in their editor.
I could possibly see actually using this if the playground had a similar "Try it on real text instantly" feature, but I've never had a problem with traditional regex+trial and error.
you may be resigned to trial and error being an essential part of regex development, but the reliance on trial and error should be (1) a signal that something can be improved with how programmers write regexs, and (2) a symptom of a programming ethos that essential things should be difficult.
I'm glad that this project (and others like that others have highlighted here) allow us to improve on (1), and I am always encouraging my profession peers in computing to reconsider (2).
For example, most regexp languages basically only have the traditional three operators of alternation, concatenation and closure/iteration. It's not necessary to stop there: the regular languages are closed under various other useful operators. Many of these operators don't even significantly increase the size of the minimal DFA.
For example, the complement (as in set/language complement) and reversal/mirror image would be some basic additions that seem like they should be necessary nowadays. Using them can make the regular expression much shorter, nicer and more understandable. For the complement, this should come at no cost for the minimal DFA size.
Some other viable operators are: intersection, set difference, merge (AKA shuffle), infiltration (also sometimes known as shuffle), interleaving.
Some other possibilites for regexp that could be put to good use more are:
* weighted regular expressions: these enable more power in a very elegant way, the idea is that programmers don't want just recognizers, so why limit regexps or finite automata to just that.
* JIT. It should be possible to compile regexps with libgccjit or with llvm.