Rulex – A new, portable, regular expression language

nsajko · on June 10, 2022

This is effectively just a nicer syntax than usual for regular expressions, but there are actual areas of semantics where regexp languages are ripe for improvement - most regexp libraries are stuck in the 80s/90s, not keeping up with recent developments in research.

For example, most regexp languages basically only have the traditional three operators of alternation, concatenation and closure/iteration. It's not necessary to stop there: the regular languages are closed under various other useful operators. Many of these operators don't even significantly increase the size of the minimal DFA.

For example, the complement (as in set/language complement) and reversal/mirror image would be some basic additions that seem like they should be necessary nowadays. Using them can make the regular expression much shorter, nicer and more understandable. For the complement, this should come at no cost for the minimal DFA size.

Some other viable operators are: intersection, set difference, merge (AKA shuffle), infiltration (also sometimes known as shuffle), interleaving.

Some other possibilites for regexp that could be put to good use more are:

* weighted regular expressions: these enable more power in a very elegant way, the idea is that programmers don't want just recognizers, so why limit regexps or finite automata to just that.

* JIT. It should be possible to compile regexps with libgccjit or with llvm.

burntsushi · on June 10, 2022

> most regexp libraries are stuck in the 80s/90s, not keeping up with recent developments in research.

Can you show me the research that states how to add things like complement and intersection to general purpose regex libraries?

Most regex engines are backtracking based, and in that context, adding complement/intersection seems pretty intractable to me.

For the small subset of regex engines that are based on finite state machines, it's pretty much intractable there too outside of niche regex engines.

In fact, the research[1] suggests that adding things like complement/intersection is quite difficult:

> In particular, we show that when constructing a regular expression defining the complement of a given regular expression, a double exponential size increase cannot be avoided. Similarly, when constructing a regular expression defining the intersection of a fixed and an arbitrary number of regular expressions, an exponential and double exponential size increase, respectively, cannot be avoided.

And indeed, as another commenter pointed out, "minimal DFA" is effectively irrelevant for any general purpose regex engine. Not only do you not have the the budget to build a DFA, but you certainly don't have the budget to minimize that DFA.

With respect to reversal, RE2 and Rust's regex crate both do that. But mostly as an internal strategy for finding the start-of-match when using a lazy DFA. It's otherwise a somewhat niche feature more generally.

With respect to JITs, plenty of regex engines out there do that. PCRE comes to mind. So does V8.

As a general purpose regex engine author, we aren't "stuck" in the 80s/90s. There are just some fundamental trade offs at play here that make your ideas difficult to support. It's not like we haven't thought about it. Moreover, for things like complement and intersection specifically, actually reasoning about them in regex syntax is pretty tricky! I'm not sure if you've tried it or not. (There are some niche regex engines that implement it, like redgrep.)

[1]: https://dl.acm.org/doi/10.1145/2071368.2071372

syrrim · on June 10, 2022

>Can you show me the research that states how to add things like complement and intersection to general purpose regex libraries?

>Most regex engines are backtracking based, and in that context, adding complement/intersection seems pretty intractable to me.

Most regex engines, using backtracking, already have it in the form of lookarounds. The performance penalty of such a strategy is often quite significant. Lookarounds are hard to automatically convert into intersection and complement, but if such primitives were exposed to the programmer, they could often reimplement their regexes in terms of them. It might occasionally lead to an unacceptable performance cost, but there would be many cases where it would be a desirable tradeoff.

nsajko · on June 10, 2022

> Can you show me the research that states how to add things like complement and intersection to general purpose regex libraries?

Sorry, don't have time for that right now, but look into regular expression derivatives and similar stuff.

> Most regex engines are backtracking based, and in that context, adding complement/intersection seems pretty intractable to me.

Yeah, I wasn't really considering irregular "regexp", they don't make much sense in theory, so of course that extending them wouldn't make much sense either. Thing is, extending true regular expressions with operators like complement and concepts like weights seems like it could make the "theoretically pure" regexps more powerful *in practice* than irregular regexps (with backreferences, etc. Certainly more understandable.

> For the small subset of regex engines that are based on finite state machines, it's pretty much intractable there too outside of niche regex engines.

Don't think so. There are many examples, but mostly implemented in "functional" programming languages.

> In fact, the research[1] suggests that adding things like complement/intersection is quite difficult:

> > ...

You misunderstood the abstract. It's not saying that its difficult to translate extended RE (RE with these additional operators) to finite automata, it's just saying that translating extended RE to traditional RE can cause a huge blowup (something I already hinted at in the comment above). So this is a pro, not a con.

> And indeed, as another commenter pointed out, "minimal DFA" is effectively irrelevant for any general purpose regex engine. Not only do you not have the the budget to build a DFA, but you certainly don't have the budget to minimize that DFA.

It's possible to produce NFA directly from extended RE, see "Antimirov derivatives" for a start. The original Antimirov derivatives don't support complementation and similar, but there are extensions that do. Search for something like "partial derivatives regular expression complement", or "derived term automata complement".

> niche regex engines that implement it, like redgrep

Thanks for the pointer.

burntsushi · on June 10, 2022

> Sorry, don't have time for that right now, but look into regular expression derivatives and similar stuff.

Obviously, I have. There's a reason why there isn't a single general purpose regex engine that uses regex derivatives. They build full DFAs and full DFAs take worst case exponential time and space.

> Don't think so. There are many examples, but mostly implemented in "functional" programming languages.

Oh I think so. Feel free to link to some examples of general purpose regex engines that implement these things.

> You misunderstood the abstract. It's not saying that its difficult to translate extended RE (RE with these additional operators) to finite automata, it's just saying that translating extended RE to traditional RE can cause a huge blowup (something I already hinted at in the comment above). So this is a pro, not a con.

No, I'm not misunderstanding anything. The bottom line is that if you have an NFA with N states and you take its complement (how do you do that without first converting it to a DFA?), then you could wind up with an NFA containing a number of states exponential in N.

What's happening here is that you aren't quite appreciating what it means to engineer a general purpose regex engine.

All of these things you're talking about are all doable and implementable in niche regex engine libraries that serve a particular purpose. And it's all been done before. What I'm talking about here are general purpose regex engines, where you can't afford quadratic time during regex compilation (which is why Thompson's construction is so popular), nevermind exponential time.

And all of this is rooted in me taking objection to your characterization of regex engines being "stuck" in the 80s/90s. I'm trying to explain to you why that's not the case.

In theory, practice and theory are the same. In practice, they're totally different.

> It's possible to produce NFA directly from extended RE, see "Antimirov derivatives" for a start.

I know. I'm not talking about what's "possible." I'm talking about what's feasible from engineering perspective. As far as I know, building such an NFA takes O(n^2) time, which is a hard sell in a general purpose regex engine.

nsajko · on June 10, 2022

There are tradeoffs, for sure; but say one creates an API with separate procedures for compilation and execution, I think one could call such an engine a "general purpose one" even with the compilation taking quadratic (or even exponential) time.

> The bottom line is that if you have an NFA with N states and you take its complement (how do you do that without first converting it to a DFA?), then you could wind up with an NFA containing a number of states exponential in N.

Are you sure that a less naive approach doesn't exist? Genuinely asking, I'm not an expert. I don't quite remember the contents, but I know there are some papers with extensions to Antimirov derivatives (see updated upthread comment) that support taking the complement of an expression.

burntsushi · on June 10, 2022

> There are tradeoffs, for sure; but say one creates an API with separate procedures for compilation and execution, I think one could call such an engine a "general purpose one" even with the compilation taking quadratic (or even exponential) time.

I don't think so. There's a reason why exactly zero popular regex engines build full DFAs. (Sometimes small DFAs are built in cases where you know they'll be very small or enforce a tight bound, but even then, the DFA is typically used as a prefilter rather than a regex engine unto itself.)

There are plenty of regex engines that do build full DFAs, but they're either relegated to the work of research or niche areas. re2c, for example, is an excellent regex engine that translates regex syntax into DFA source code. It thus has procedures for compilation and execution, but it is not what I would call a "general purpose" regex engine.

As another example, if I were to make Rust's regex crate use quadratic (or far worse, exponential) compilation, I'm pretty sure I'd have many upset users. (Now, I could use quadratic or even exponential compilation in cases where I'm sure that the absolute wall clock time won't be disagreeable, but I think it's safe to exclude such things for the purposes of this discussion.)

Even for DFAs that don't take exponential time, Unicode means that DFAs get obscenely big. \pL, for example, is 279 states and uses 143K of memory. And it takes around 8ms to compile on my machine. 1ms is already pushing it.

> Are you sure that a less naive approach doesn't exist?

Pretty sure. I'm not an expert in the theory either. I'm just aware of most of it. And I'm totally unaware of anyone who has overcome the practical engineering challenges of things like complement and intersection.

nsajko · on June 10, 2022

One of the papers that I was referring to above is "Derived-term automata for extended weighted rational expressions" by Akim Demaille, from 2016. I haven't managed to study the topic more closely yet, but the idea seems to be to compute the DFA lazily, only creating the states and transitions when they are required. No idea how well this can perform in practice, but this seems like a valid way to avoid the exponential blowup, and it's already implemented in the Vcsn (formerly "Vaucanson", I think) software.

Vcsn link: https://www.lrde.epita.fr/wiki/Vcsn

burntsushi · on June 10, 2022

Yes, that's what RE2, the regex crate and a few other regex engines already do. (Specifically, build the DFA lazily.) Everything I've said I've said with that full context.

I don't know, maybe just don't make grand pronouncements about regex engines if you aren't actually familiar with how they're implemented and the engineering tradeoffs in play?

nsajko · on June 11, 2022

So what's the problem with extending regexp then?

saghm · on June 11, 2022

It seems pretty clear from reading the above thread that burntsushi is saying that they have not been able to find a way to implement the extensions you suggested without making the compile time unfeasible for general use. Given how prolific their work in writing regex and other text-search related libraries is, it seems reasonable to take them that their word that they have in fact spent time trying to solve this problem. On the other hand, it doesn't sound like there's any any practical implementations that anyone can point to that can in fact add these extensions in a way that doesn't cause the compile time to be infeasible. Given that, you seem to be making the stronger claim, so it's a little strange to put the burden of proof on the other side of the argument.

burntsushi · on June 11, 2022

I already told you. I'm done wasting my time speaking with you in this thread. In particular, you seem unwilling to acknowledge any specific trade offs that I've pointed out.

Joker_vD · on June 10, 2022

Wait, if one wants to match a complement of a regex, surely it is equivalent to seeing if matching the original regex would fail? Or do I miss something obvious?

burntsushi · on June 10, 2022

I mean, yes? But it's kind of missing the forest for the trees. Regex searching isn't limited to "did it match or not," but also to finding where the regex matched in the first place. For example, let's say you want to match all words except for 'foo'. You might write, '\b(!foo)\b', where the '(!re)' syntax means "complement of 're'." That would then match 'bar', 'baz' but not 'foo' in 'bar foo baz'.

It's like how character class set notation has a 'negation' feature. e.g., '[a-z]' matches a-z, but '[^a-z]' matches anything except for 'a-z'. So you might say, "why bother with adding negation when you can just check if the original char class would fail?" The answer is that without negation embedded in the regex itself, you miss out on composition.

This is another reason why features like complement and intersection aren't particularly popular. They are weird to reason about. Another reason is that many of their use cases can be achieved through other means. For example, my example above could conceivably be done by iterating over all words (using word segmentation, for example) and just checking whether each matches 'foo' or not. It could also likely be accomplished through the use of negative look-around, commonly found in backtracking regex engines and not finite automata based regex engines.

nsajko · on June 10, 2022

Your proposed implementation makes sense for the case of merely taking the complement of a simple regular expression. But the point of having the complement operator available is to be able to take complements of arbitrary sub-expressions within the RE, and even having nested complements in the RE. In that case your proposed implementation would be relatively inefficient.

nsajko · on June 10, 2022

I'm curious about why someone would downvote the comment I'm replying to.

surrealize · on June 11, 2022

I downvoted it because you were talking to burntsushi, who has implemented an excellent general-purpose regex engine and is very thorough, thoughtful, and knowledgeable about it. When he said "Can you show me the research", that was him being exceedingly generous to your point of view. And when you said "Sorry, don't have time for that right now, but look into regular expression derivatives and similar stuff", that was you being the opposite of generous.

nsajko · on June 11, 2022

Thanks for voicing your opinion. That said, I don't see constructive criticism here. From what I gather I was basically at fault for daring to somewhat disagree with burntsushi.

> who has implemented an excellent ...

This is an appeal to authority.

> being exceedingly generous to your point of view

Now you're just insulting me!?

> you being the opposite of generous

Consider that I couldn't have known upfront that burntsushi is acquainted to any degree with RE derivatives and similar concepts. His regexp engine is not based on them AFAIK. So I pointed to those for a start. Later in the discussion, including in the same comment, I pointed to other sources, too. So I think it's not fair to call me ungenerous.

The whole "general-purpose regex engine" thing was moving the goalposts in the first place and not very relevant to this whole discussion anyway. EDIT: I'm not saying that these distinctions don't matter at all, rather what I want to say is that it doesn't make sense to implicitly call upon a subjective and arbitrary standard.

saghm · on June 11, 2022

> Thanks for voicing your opinion. That said, I don't see constructive criticism here. From what I gather I was basically at fault for daring to somewhat disagree with burntsushi.

I don't think it's so much "what you said" or even "who you said it to" but "how you said it". Most of us are coming to this thread specifically because we _don't_ already have specialized knowledge in this area, and we were hoping to learn something. It's totally reasonable not to have time to explain something this complicated, but it comes across as a bit disingenuous when followed by many additional comments in the thread continuing to debate in what looks to be close to real time. I think in this case, people are downvoting more for a breach of "decorum" than for anything else; you might very well be correct in your assertions about regexes, but it's impossible for any non-experts to tell, so we end up seeing (fairly or not) picture of someone who is more interesting in spending time arguing that they're right than actually demonstrating it.

surrealize · on June 11, 2022

> From what I gather I was basically at fault for daring to somewhat disagree with burntsushi.

Not so much for disagreeing. More for engaging a lower-effort way (than the person you were talking to) and appearing to assume that burntsushi was coming from a place of ignorance.

> The whole "general-purpose regex engine" thing was moving the goalposts in the first place and not very relevant to this whole discussion anyway. EDIT: I'm not saying that these distinctions don't matter at all, rather what I want to say is that it doesn't make sense to implicitly call upon a subjective and arbitrary standard.

"general-purpose regex engine" is a direct response to "most regexp libraries are stuck in the 80s/90s". It's about why the regex libraries most people use (the general-purpose ones) haven't adopted the ideas that you're advocating. Seems on-point to me; on HN most readers will be mainly familiar with the general-purpose libraries, so they'll be thinking in terms of those.

earleybird · on June 11, 2022

> This is an appeal to authority.

I understand how you might think this. However, please consider that the statement you were responding to is establishing a bonafide - a reason why burntsushi's words might be worth consideration.

Sometimes an appeal to authority is not wrong.

In my experience, the more I study a topic, the smarter other folks appear.

mtlmtlmtlmtl · on June 10, 2022

I'm not sure how useful even more powerful regex would actually be. Using large, complicated regexes is a very bug-prone practice in general.

nsajko · on June 10, 2022

> Using large, complicated regexes is a very bug-prone practice in general.

Yes, this is exactly why extending RE is necessary.

Alex3917 · on June 10, 2022

What's the overall best regex library today that can be used in production? E.g. in terms of features, speed of compilation, and speed of matching.

pygy_ · on June 10, 2022

I wrote a JS/TS lib that adds support for RegExp composition, set opperations, atomic groups and arbitrary bounds.

https://github.com/compose-regexp/compose-regexp.js

pygy_ · on June 10, 2022

... and FWIW no one seems to care one bit about it ...

(this would have been an edit, but I'm out of the window)

alrlroipsp · on June 10, 2022

At least back in my day, thread hi-jacking was kinda frowned upon. You kids are like aliens to me.

pygy_ · on June 11, 2022

When I posted, the submission had 123 points, and nsajko’s critical comment was at the top of the thread.

My lib’s show HN which addresses most of his concerns, royally got 5 votes.

Given that Rulex was obviously a success, I didn’t feel like my comment would take from them.

Also, the discrepancy between the upvotes on that comment and the lack of uptick for compose-regexp felt interesting to point out

RcouF1uZ4gsC · on June 10, 2022

> * JIT. It should be possible to compile regexps with libgccjit or with llvm.

Although not JIT in that it needs the regex specified at compile time, both Rust and C++ (using the compile time reflex library) can generate the code to evaluate the regex at compile time.

burntsushi · on June 10, 2022

The regex crate (for Rust) does not have this capability. It used to many years ago via the regex_macros crate, but I abandoned it mostly because of not having enough resources to dedicate to it and its somewhat niche usage. If you need compile time regexes, it's better to use something like re2c.

The reflex library is interesting, but more like a C++ template hack. Its codegen is impressive, but it's missing (last time I checked) a lot of the classical optimizations that a regex engine typically has. Like a literal prefilter scan.

RcouF1uZ4gsC · on June 10, 2022

> It used to many years ago via the regex_macros crate, but I abandoned it mostly because of not having enough resources to dedicate to it and its somewhat niche usage.

As an aside, my impression is that Rust developers are using macros less now than they were even a couple of years ago. It seemed a few years ago, a lot of people coming to Rust were impressed by the power of macros (especially procedural) and used them in a lot of libraries. It seems now, people are realizing the costs of macros to tooling and understandability, and are using them more judiciously.

burntsushi · on June 10, 2022

I don't think I've detected any such shift personally. (I've been using Rust daily since before 1.0 was released.)

SonOfLilit · on June 10, 2022

Note that "minimal DFA" is not very relevant these days, with Google's `re2` being the only implementation I'm aware of that implements a DFA. All others sacrifice predictable performance for feature parity with Perl.

burntsushi · on June 10, 2022

It's certainly not the only one. Rust's regex crate (of which I'm the author) and Go's regexp package both use finite automata.

It should be noted that none of them build out complete DFAs in the general case. Building a DFA is worst case exponential time and space, and they can in practice be expensive even when not hitting the worst case.

Instead, they use different techniques. RE2 and the regex crate, for example, use a hybrid NFA/DFA that does subset construction at search time. Outside of pathological cases, it provides the speed of a DFA without needing to build the whole thing up front.

Hyperscan is also built on finite automata. GNU grep uses finite automata in many cases too.

ankitpati · on June 10, 2022

Not a single other implementation has feature-parity with Perl. Perl regexes are not only Turing-complete, but can also execute arbitrary code¹ as part of the regex evaluation.

PCRE is the pinnacle of deceptive advertising. It is emphatically not Perl-Compatible.

Source: I used to write Perl for a living, and I have battle scars from (¹) above.

macintux · on June 10, 2022

When Sun released Java and went on a marketing road show, I drove to Chicago to hear the presentation. Since Perl was my favorite language, I had to ask about regular expression support; the Sun engineer said there wasn’t any, but I could build the first library.

To me, it was inconceivable that you could release a major language and not only not have regular expressions built into the language like Perl, but not even have it in the standard library. Crazy.

latchkey · on June 10, 2022

When I found out that Jonathan Locke had even a minimal Java implementation of a regex engine, I jumped on it and messaged him to open source it to the Jakarta project. Crazy that that was 20 years ago...

https://jakarta.apache.org/regexp/changes.html

burntsushi · on June 10, 2022

We did the same thing for Rust. Regexes are neither in the language nor in the standard library. Go doesn't have them in the language either (although they're in the standard library).

macintux · on June 10, 2022

Rust wasn’t trying to become the standard web programming language.

burntsushi · on June 10, 2022

I didn't realize that was a necessary precondition in your evaluation of what makes something "crazy." Sorry.

macintux · on June 10, 2022

Sorry, early morning. I should have been more verbose.

Rust has a different target problem space, a different standard library philosophy, a package management system, and exists in a world where adding external dependencies is trivial.

Java didn’t have any of those excuses.

oblio · on June 10, 2022

Well, Java turned out quite all right on that front :-)

kevincox · on June 10, 2022

It is incredibly relevant if you want to process uset-provided expressions. Otherwise you expose yourself to a DoS attack.

The Rust regex crate is also DFA based and safe for untrusted inputs.

Someone · on June 10, 2022

.NET added timeouts to handle that. It’s a per-regex setting with a global default (https://docs.microsoft.com/en-us/dotnet/api/system.text.regu...).

nsajko · on June 10, 2022

I guess that adding timeouts can significantly decrease the run time efficiency?

Someone · on June 10, 2022

It can, but this isn’t a hard timeout. From the description (“The MatchTimeout property defines the approximate maximum time interval for a Regex instance to execute”), it seems they have their regex engine periodically (say after handling every 10,000 instructions of the compiled regex) check whether they went over the limit.

orra · on June 10, 2022

Plus, as a general point, .NET has pretty decent performance. It's no C++, but at the same time substantial effort is put into optimisation.

In fact for both .NET v5 and v7, they specifically did work to improve the performance of regexes. For example there is a new non-backtracking option. https://devblogs.microsoft.com/dotnet/regular-expression-imp...

SonOfLilit · on June 10, 2022

Nice to know about Rust. From reading the docs the situation seems to be more complicated and more impressive: https://docs.rs/regex/latest/regex/#untrusted-input

masklinn · on June 10, 2022

There’s also the postgres regex engine, I don’t know how it works under the hood but it’s very resistant to catastrophic backtracking despite features which normally translate to nfa (e.g. backrefs).

nsajko · on June 10, 2022

> features which normally translate to nfa (e.g. backrefs)

You're very wrong, backreferences in general can't be translated to finite automata (NFA is short for nondeterministic finite automaton). See these comments:

https://news.ycombinator.com/item?id=18431610

https://news.ycombinator.com/item?id=27877743

masklinn · on June 10, 2022

Maybe try to read my comment again because I don't think you understood what I wrote.

nsajko · on June 10, 2022

You say:

> resistant to catastrophic backtracking despite features which normally translate to nfa

I interpret this as you equating NFA with backtracking engines, which would be incorrect. Is my interpretation wrong?

joe_guy · on June 10, 2022

https://devblogs.microsoft.com/dotnet/regular-expression-imp...

Stephen has put incredible work into having an alternative implementation in .net 7.

For anyone interested in regex engine inner workings and optimizations, including SIMD, the blog post is fantastic.

btown · on June 10, 2022

On a related note, if you have Python regex code that you want to make more stable/performant, https://pypi.org/project/pyre2/ is a drop-in replacement for `re` that (configurably) falls back to `re` if you use lookaheads, etc.

The design philosophy behind RE2 for those unfamiliar with the library: https://github.com/google/re2/wiki/WhyRE2

nsajko · on June 10, 2022

Various regexp implementation possibilities do exist, whether in software or in hardware. I think my general point stands independently of which regexp implementation is used (although the comparisons get hairy if you include the irregular expression "regexps").

BurningFrog · on June 10, 2022

> This is effectively just a nicer syntax than usual for regular expressions

OK, but the painful syntax is the biggest problem for regular expressions!

avgcorrection · on June 10, 2022

Do operators like complement work with Unicode?

dan-dev · on June 10, 2022

Yes and straighforwardly so if you use character classes as your basic building blocks. Here I implemented a Haskell implementation that is easily extandable to include complements: https://github.com/dan-blank/hgrep-smallcore (I like this project because it translates ERE compliant regexes - sans negated character sets - down to only 4 constructs, one of which being character classes). It implements https://www.ccs.neu.edu/home/turon/re-deriv.pdf, character classes are described in 4.2.

I actually had complement in it as a 5th construct, but when the submission came closer and the examiners found some errors in my logic (my fault for not writing good enough unit tests!), I took complement out again when cleaning the project up.

burntsushi · on June 10, 2022

I tried to test your program because I'm pretty sure your techniques can't be used in a general purpose regex engine. (I've long wanted to make use of regex derivatives somehow, but I don't think it's feasible because of the downsides of building up a full DFA.) More to the point, I also suspected that you might be using a sparse representation for transitions in your DFA if the alphabet of your DFA is indeed all of Unicode. This is also problematic because it tends to make search time quite a bit slower.

In any case, I built your program with 'stack install' but got this when I tried to use it with Unicode:

    $ hgrep-exe '^\pL{42}$' OpenSubtitles2018.raw.sample.en
    hgrep-exe: Maybe.fromJust: Nothing

I get the same for '\w{42}'. Hmm, maybe you don't support counted repetitions? OK, '\w' works, but is it Unicode aware?

    $ echo 'β' > /tmp/beta
    $ echo 'b' /tmp/b
    $ hgrep-exe '\w' /tmp/beta
    $ hgrep-exe '\w' /tmp/b
    $

Hmmm, no, but '\w' doesn't seem to work at all... I can't seem to get much working:

    $ hgrep-exe '[a-z]' /tmp/b
    $ hgrep-exe 'b' /tmp/b
    b

OK, so a simple literal search works.

I don't know. I'm not sure how to do a Unicode stress test with your tool.

dan-dev · on June 10, 2022

Thank you for trying out my tool!

Not that this was just an university project that is far from polished, and certainly not fast!

* I don't think it supports '\w' - is that part of the ERE? (otherwise I will lower my claim in the project description)

* Repetition works like this: {1,3}, did not add the syntactic sugar

---------

My setup:

me:~/hgrep-smallcore$ echo 'awd え bbb c' > test

me:~/hgrep-smallcore$ /home/me/.local/bin/hgrep-exe 'f' test

me:~/hgrep-smallcore$ /home/me/.local/bin/hgrep-exe 'b{1,3}' test

awd え bbb c

me:~/hgrep-smallcore$ /home/me/.local/bin/hgrep-exe 'え' test

awd え bbb c

burntsushi · on June 10, 2022

I think you should probably cut out the ERE thing entirely. Or maybe say it's "inspired" by ERE. Otherwise, it's not really that close at all. Notice that not even 'hgrep-exe '[a-z]' /tmp/b' worked for me either. That's certainly in EREs. So are things like '[[:word:]]', which also doesn't seem to work. There's also collating symbols and equivalence classes and some other junk.

But upon looking at the POSIX ERE spec, yes, it looks like technically things like \w, \d and \s are not in it. But most ERE implementations, including both BSD grep and GNU grep, support constructs like \w. (Yet another reason to cast a suspicious eye on folks who obsess about portability. Portability means following a spec, not using whatever your tool lets you do. And most tools let you do far more than what's in the spec because the spec---especially one like POSIX---is usually divorced from the reality of what's useful.)

I understand your tool is a university project. The main point I'm trying to drive home here is that there are folks in this thread that seem to not be keen on acknowledging engineering challenges and are instead only looking at the theory. Unicode, for example, is an enormous engineering challenge. It's not difficult because getting it correct is difficult, it's difficult because making it correct and fast is not straight-forward. As you show, using a naive representation (sparse transitions, hash sets for states) will give you correctness without much complexity. But that's not usually what we mean when we talking about supporting Unicode in general purpose regex engines. Because "general purpose" means folks expect it to be minimally fast.

dan-dev · on June 10, 2022

Ah! I was only referring to ERE as specced by POSIX, yes.

> It's not difficult because getting it correct is difficult, it's difficult because making it correct and fast is not straight-forward.

It seems to me that you take my project as a proxy for whether regex derivations are a feasible way to deal with unicode-ready regexes that also support complements and so on. That was not my intention, I was merely attempting to show an easy implementation of regex derivations that can deal with unicode and can be extended to support complements. With this project and the paper I linked, it seems to me that answering whether this particular kind of constructing regex DFAs is a possible way to achieve what you are looking for or not should be rather straightforward.

(To my last knowledge, a regex complement is not easy to add in the presence of extra features like backtracing and lookahead.)

burntsushi · on June 10, 2022

> With this project and the paper I linked, it seems to me that answering whether this particular kind of constructing regex DFAs is a possible way or not should be rather straightforward.

Yes, you're correct. It just isn't that interesting because it will fall over in any kind of real practical usage. :-) Full disclosure, I'm the author of ripgrep, so I have some particularly relevant experience in the specific domain of making regexes work well for the masses. (It also forms my bias with what I care about.)

Like I said, I wasn't necessarily trying to pick on you. There are others in this thread that are seemingly ignoring the engineering challenges, and only looking at the theory. Then using that as a basis to proclaim that regex engines are "stuck" in the 80s/90s.

Consider my perspective here: if someone pipes up and says, "yeah hey it is actually easy to add those kinds of features to regex engines." Well, then, it's important to include the caveats with that. That's what I see as my role in this thread. Otherwise, it looks like general purpose regex engines are crippled for no apparent reason.

dan-dev · on June 10, 2022

Oh well, first of, thank you for writing ripgrep! Great work, love that tool and use it daily :)

All good, thank you for sharing your POV! I find both theory and the practical engineering feats quite interesting, and actually understood the original question as a theoretical question, not a question about general-purpose regex engine. Quite possibly that this is even how the asking person was intending it, in hindsight.

Last thing on topic (talking now about having complement and unicode in a usable regex engine): My personal, gut-feeling-based hunch is that by adding all these features not strictly related to regular languages, we kind of evolved our regex enginges and expectations towards them into a corner of the design space that works well, but might not work well when one adds complement. It feels like a constructivist world without negation and stuff on top that works, and reverse-engineering what works with negation and what not will likely be hard work. As a theory-ish kind of person (without much insight into regexes or formal grammars), I wish there would be some kind of insight into what are the atoms of a regex and which atoms can be combined witch which other atoms - I would bet that fancyFeatures and unicode+complement would have a high algorithmic complexity already in theory.

Any final tipps on the repo I linked to not disappoint users expectations? :)

burntsushi · on June 10, 2022

I think you got it right by putting "university project" in the description. :-) I wrote my fair share of projects like that too.

The incentive structure of academia that leads to software like that, though, is also a big reason why I left. The theory is super important (that's why I work on a regex engine that only provides support for regular languages), but I also think the engineering aspect is just as important. And academia, or at least the corner I was in, just did not care about the engineering side of things at all. This in turn leads to publishing results that are hard to reproduce, because if you don't share code that is at least somewhat robust, it's going to be very costly for someone else to build on your work.

Anyway, sorry about the side rant haha. But all good.

dan-dev · on June 11, 2022

All good! And sorry for not making the effort to understand your position first, I want to work on that.

> The incentive structure of academia that leads to software like that, though, is also a big reason why I left. The theory is super important (that's why I work on a regex engine that only provides support for regular languages), but I also think the engineering aspect is just as important. And academia, or at least the corner I was in, just did not care about the engineering side of things at all. This in turn leads to publishing results that are hard to reproduce, because if you don't share code that is at least somewhat robust, it's going to be very costly for someone else to build on your work.

Totally get it. This also baffles me - papers in computer science that are hard to reproduce? Yes I know, publish or perish and all that BS. But I don't want to accept this: I mean I get it for psychology obviously, and then also material sciences and so on, but for compsci?... Next to proofs, programs are probably one of the most reproducible things that nature has given given us.

Good for me and many others that you cho(o)se to spend your time on building tools :)

avgcorrection · on June 11, 2022

Best not to get uppity on the subject of regexp when confronted with the great burntsushi… he can get irate.

avgcorrection · on June 10, 2022

> Full disclosure, I'm the author of ripgrep, so I have some particularly relevant experience in the specific domain of making regexes work well for the masses.

The unwashed ones?

bmn__ · on June 10, 2022

> it looks like technically things like \w, \d and \s are not in [POSIX]

The relevant standard is UTS #18, it subsumes POSIX afaict. Do you think the same as me about it, namely that following and implementing it is essential?

burntsushi · on June 10, 2022

We were talking about EREs, which are an artifact of POSIX, not UTS#18. So the relevant standard for this specific conversation is POSIX.

To redirect to UTS#18, I don't think UTS#18 subsumes POSIX. UTS#18 doesn't support [[=a=]] for example AFAIK. And UTS#18 more generally doesn't require locale support. UTS#18 Level 3 was actually removed from the spec, which is where "custom tailored" logic for specific locales used to live. On top of that, POSIX also specifies BREs which UTS#18 doesn't touch. So while there is overlap between POSIX and UTS#18, POSIX is not a strict subset of UTS#18. If you're speaking "conceptually" and less precisely, you can maybe say POSIX is subsumed by UTS#18 though. I don't really think about it that way though personally. They serve two different use cases that are still relevant today.

I think UTS#18 is a tortured document, but yes, the regex crate supports pretty much all of UTS#18 Level 1: https://github.com/rust-lang/regex/blob/master/UNICODE.md

Going beyond Level 1 is difficult.

tinco · on June 10, 2022

Mostly irrelevant but those runtime errors are the main reason why I stopped using Haskell. Haskell is a great language with a great booming ecosystem, but if you can have uncaught exceptions in IO monads, and even unchecked errors in your pure code like that fromJust there, then what's the point? For all its ceremony is not much more functionally pure than JavaScript. And a language without purity as a concept like Rust has achieved frankly a more tight feeling of safety and great flexibility despite having a more simple typesystem. Imo Haskell has to drastically change like it did in the early 2000's to become competitive again.

burntsushi · on June 10, 2022

I'm unconvinced by such things. From what I can tell, this is academia code. In academia code, regardless of programming language, folks often don't care about failure modes. (I speak from experience, as someone who used to be in academia.) So I'd be more inclined to pin it on academia's incentive structure than anything about programming languages.

I suppose you could use a programming environment that forbids partial functions, but I'm unclear on how productive that is.

> And a language without purity as a concept like Rust has achieved frankly a more tight feeling of safety and great flexibility despite having a more simple typesystem.

Haskell has 'fromJust' and Rust has 'unwrap'. They're both exactly equivalent and result in similarly poor failure modes. (Well, 'unwrap' usually at least gives you a line number.)

tinco · on June 10, 2022

Yeah, but even better Rust has `except`, but that's besides the point. Rust doesn't pretend to by a pure language, it just promises to be a safe language as long as you don't use `unsafe`. Haskell makes a big fuss about how it's pure and productive because you have fewer bugs. And this is true as long as you're writing pure data processing or algorithmic code. But as soon as you stray from that and try to build a practical program, you'll need the IO monad and it sucks, even Java deals better with exceptions than Haskell. All the type class expressiveness in the world, and they couldn't figure out how to model exceptions. And it used to be that Haskell could just introduce backwards incompatible changes because it was a research language. But now Haskell is used widely in production, and even though it's in desperate need of a new Prelude, it's not gonna happen.

When someone who truly believed in that Haskell could be a proper modern programming tool and invested their hard work and frankly their genius into it, they Haskell leadership responded with skepticism and inaction. I bet if you gave Snoyman full dictatorship over a new Haskell prelude, it would be the best general purpose programming language by a mile for the next decade at least. Instead Haskell will remain as backwards as its ISO 8859-1 String type.

burntsushi · on June 10, 2022

Yeah I think you're going places I don't care to follow. We were talking about a specific program, and I made the argument that the behavior of the program was more sharply determined by the incentive structure in which it was written. But you're kind of doubling down into a more general rant about Haskell that is actually totally divorced from the failure mode of the Haskell program in question.

I used to use Haskell too and I stopped using it for $reasons. Nothing to do with the prelude or its string type though. More about the nature of the language and paradigm itself.

dan-dev · on June 10, 2022

Hold on! See: https://github.com/dan-blank/hgrep-smallcore/issues/1 Using `fromJust` is deeply frowned upon, and indeed a few eyebrows were raised then showing my code to the examiners. This and a handful other warts unfortunate, but few and easy to work around.

This function and others are in the prelude, but one can use other preludes that don't have these escape hatches.

This does not undermine the huge benefits that people get from having IO checked by the typesystem. Just as having tyepcasting in a language does not undermine the benefits of types in that language.

tinco · on June 10, 2022

IO isn't checked by the type system if errors aren't checked by the type system. Haskell could be better, Haskell should have been better.

dan-dev · on June 10, 2022

Nah, the type system is perfectly fine in this case: "Given (Just someValue), return someValue". Exactly what happened here, the precondition just was not fulfilled and I left the case dealing the unfulfilled precondition undefined. If one cares about such things, one should simply use one of the numerous alternative Preludes out there (to make sure to actually never use fromMaybe, head and the 3 other functions that nobodoy uses form the standard Prelude) and turn the warnings for incomplete pattern matches on. At least, thats how you could deal with it in real life.

If it is not practical matters that are the concern here, there are more than enough languages that only allow total functions and offer other sweet stuff. Haskell's main purpose was to be a lazy-by-default-language, and that people can actually write practical stuff in it is a nice side-effect.

nsajko · on June 10, 2022

Of course, Unicode is just another alphabet. Specific expressions might be problematic though, depending on the implementation.

EDIT: If malicious inputs are something to be worried about, I think a good way to do this would be to check for exceedingly costly blowups of various kinds during/after translation to a finite automaton (so before automaton execution). This would enable having complemented expressions when safe (in lots of cases, at least).

joppy · on June 10, 2022

To get something useful and not attackable, the complement operation would need to take the complement inside the set of valid UTF-8 strings rather than the set of all byte-sequences, which is probably doable but I'm not sure how straightforward it would be.

nsajko · on June 10, 2022

> the complement operation would need to take the complement inside the set of valid UTF-8 strings rather than the set of all byte-sequences

Anything else would be a bug in any case.

pizza234 · on June 10, 2022

I'm curious if this does concretely improve regular expressions or not for generic programmers. Overall, to me it seems the classic minor-but-not-radical improvement.

The major difference I see is that this is a sort of `/x` modified regex with better readability. This is a significant improvement - the big problem with `/x` is that one then needs to escape/encode whitespace, which typically offsets the improvement.

The other significant change is variables, however, when one reaches a certain complexity, IMO they should not use regexes anymore.

The other differences (based on the home page presentation) are cosmetic changes of the same semantics.

The fundamental assumption is that the main problem of regexes is about syntax rather than semantics. It's true the regexes are somewhat ugly and hard to remember, but, personally, I don't think that's the main problem. It would have probably been nice if regex started with this syntax, but integrating them in a new project is a difficult decision.

ethbr0 · on June 10, 2022

The only problem with regex is that people try to parse HTML with it.

And in that sentence, neither 'regex' nor 'HTML' are the root of the problem.

IMHO, ignorance should never be a valid feature request.

berkes · on June 10, 2022

Are you honestly implying that there are still people who, in all seriousness, use a REGEX to parse HTML?

Every programmer I have met, however junior, knows or learns about Zalgo early on. It's like the "foobar" wrt regexps: common knowledge.

True, knowing it's a bad idea won't hold people from doing it. I do it. I'll whip up some stupid `:%s/\<?p class="strong"?>/<strong>/g` occasionally in vim, knowing perfectly fine that this regexp won't be solid, but: works for this case. All the while knowing about Zalgo for years (decades?)

dmitriid · on June 10, 2022

> Are you honestly implying that there are still people who, in all seriousness, use a REGEX to parse HTML

Subsets of it? Yes. See Google's lit-html as an example: https://github.com/lit/lit/blob/main/packages/lit-html/src/l...

ethbr0 · on June 10, 2022

Yes, but the specific problem is that it's not general knowledge exactly why you shouldn't attempt it.

Ask a handful of developers and see how many mention Chomsky Type 3 grammar, or even regular grammar.

And unless you scope an understanding of what regexes are to that, which is non-trivial, you get some bizarre feature requests, because people are attempting to use it inappropriately.

They don't know what they don't know.

bmn__ · on June 10, 2022

> Overall, to me it seems the classic minor-but-not-radical improvement.

As a user of Perl, PCRE, JS xregexp, Python regex, Raku grammars, I concur with your opinion.

There are a few other top-level comments that also don't see the point. The documentation is lacking a section about the rationale.

chubot · on June 10, 2022

I added this to the Alternative Regex Syntax wiki page with about a dozen similar projects:

https://github.com/oilshell/oil/wiki/Alternative-Regex-Synta...

e.g. compare with Melody 3 months ago: https://news.ycombinator.com/item?id=30358554

and Oil's Eggex:

https://www.oilshell.org/release/latest/doc/eggex.html

From a quick glance Rulex looks very similar to Eggex!

A difference is that Eggex is embedded in a shell so you can use normal assignment statements to build up subpatterns. And you can also interpolate directly into an 'egrep' or 'awk' command.

runlevel1 · on June 10, 2022

What I'm liking about Rulex is that it's striking a good balance between what's new (mainly readability improvements) and what's familiar.

It's close enough to Regex that I could probably get used to it in a few hours.

Looking through the items on that list, IMO only Oil's Eggex seems to hit that same sweet spot.

chubot · on June 10, 2022

Yes that's my assessment too!

Funny thing is that I designed a similar project called "Annex" in Python around 2014.

But when I looked back on it after a few years with fresh eyes, it was just too different! It was hard to remember.

So Eggex is very similar to the common syntax -- it's actually roughly the same as lex or re2c. I think there's a case to make that we we ALREADY have a good syntax for regexes in lexer generators -- it's just the POSIX/Perl syntax that is bad!

The other thing about Annex is that it was a Python library, and it was never worth the extra dependency. (IMO PyPI has a lot of friction.)

That is, after writing it, I already knew stdlib re syntax so well it wasn't worth it :) But Eggex is in a shell, which is a context that makes a lot of sense IMO.

moonchild · on June 10, 2022

See also: raku's grammars (https://docs.raku.org/language/regexes).

asojfdowgh · on June 10, 2022

can't overstate how nice raku is, bit sharper learning curve but still

    raku:       ^ ( \d+ <?{ $/ <= 255 }> ) ** 4 % '.'  $
    slow raku:  ^ ( @(0..255) ) ** 4 % '.' $ 
    rulex:      <% ( :( range '0'-'255' ) '.' ){3} :( range '0'-'255' ) %>
    regex:      ^(?:(0|1[0-9]{0,2}|2(?:[0-4][0-9]?|5[0-5]?|[6-9])?|[3-9][0-9]?)\.)((0|1[0-9]{0,2}|2(?:[0-4][0-9]?|5[0-5]?|[6-9])?|[3-9][0-9]?))$

first one running code in the middle of the regex, to see if it matches a mathematical property, the second, running code to generate a [0|1|2|3|...|254|255] equivalent, which also matches as tightly as the rulex/regex code (e.g. no 00 allowed)

lbriner · on June 10, 2022

None of those seem much easier to understand than the others. I suspect this is a classic example of "let's take something that is really messy in pcre and compare it other more brief equivalents" but I could easily turn the regex into something more readable and I can find more help for the standard regex syntax than I can for any alternative.

As other's have said, it is a minor arguably cosmetic improvement and something that is likely easier just to use a visual helper like regex101 to help you construct.

donatj · on June 10, 2022

Seems like a case where I would just \d+\.\d+\.\d+\.\d+ and validate it in code

asojfdowgh · on June 10, 2022

can still do that logic too, without leaving the parsing step, per say

    /^ (\d+) ** 4 % \. <?{ $/[0].all <= 255 }> $/

or if wanting it really explicit

    /^ (\d+) \. (\d+) \. (\d+) \. (\d+) <?{ $/.all <= 255 }> $/

matthews2 · on June 10, 2022

See also: Emacs' structured regexp notation, https://www.gnu.org/software/emacs/manual/html_node/elisp/Rx...

bjoli · on June 11, 2022

Orr Irregex and it's more limited SRFI:

http://synthcode.com/scheme/irregex/

https://srfi.schemers.org/srfi-115/srfi-115.html

cardanome · on June 10, 2022

Oh, I like that one! Very easy to read.

Have you ever worked with it and what was you experience?

I can see myself using that or even implement it for other languages.

matthews2 · on June 10, 2022

It can end up taking a little longer to write (same with all of these alternate syntaxes), but I think it significantly reduces the WTF-factor when you have to try and read it again in weeks/months/years.

I think it's worth it for any non-trivial regexp and I miss it whenever I'm using any other language, so I'm glad that this thread is full of alternatives :)

natrys · on June 10, 2022

Not OP but I have switched to using rx macro for everything and I never miss regex. I miss rx instead everywhere else. Here are couple of blog posts someone might find useful:

https://francismurillo.github.io/2017-03-30-Exploring-Emacs-...

https://karthinks.com/software/bridging-islands-in-emacs-1/

rcthompson · on June 10, 2022

I haven't used rx in Elisp specifically, but in my work in R, I've started using the rex package almost exclusively for writing regexes. It's the same idea: you build a regex programmatically instead of as a string, so you no longer need to think about syntax, only semantics, because the syntax is the same as the host language.

ur-whale · on June 10, 2022

Aside from the parens fetishism common to all lisp-flavored things, it is very nice and readable indeed.

rurban · on June 10, 2022

Isn't it strange that the right hand side (the resulting regex) is more readable and precise than its abstract language on the LHS?

XorNot · on June 10, 2022

This was my first thought too: the example regex's are visually easier to parse then the rulex's.

If there's a benefit, I think it needs to be more clearly displayed first - the simple examples are too simple to seem like there's a problem.

(also, going as rulex -> regex is a weird way to present things when you're trying to sell "use rulex instead of regex because its simpler").

cphoover · on June 10, 2022

Just looked at Rulex briefly... am I the only one that thinks it isn't obviously more readable or easy to understand? Maybe it's because I've taken the time to master regex, but this syntax seems just as opaque as terse regex syntax was to me prior to digging in.

__david__ · on June 10, 2022

You are not there only one. REs look like gibberish until you understand them. To me Rulex looks like gibberish but the REs they turn into make sense. I don’t think the syntax is inherently more readable, and the whitespace insensitivity + comments feature has been in Perl and Ruby REs for decades.

SCLeo · on June 10, 2022

I think the main appeal is its ability to transpile to different languages, so that you can use whitespace insensitiveness + comments in other languages where those features are not supported.

skrebbel · on June 10, 2022

I really like this. One big benefit of this notation, that I feel is under-highlighted in both this comment page and the Rulex site, is that there are no backslashes in Rulex.

A lot of regex usage is inside strings inside another programming language, which typically means double backslash escape messes such as \\\\" and it's just a mess. This makes regexes particularly beginner-unfriendly, because you gotta parse it out at different levels.

No backslashes means that you can put a Rulex expression in a string in just about any language, not escape anything, and still get a predictable result.

I suspect that a lot of its design actually follows from that requirement. Eg having to quote all character literals/sequences - that's needed because that way, eg a newline can just be "[n]", even inside a character class. (eg ['a'-'z' n]).

Also, it looks like you can choose to use either single-quotes or double-quotes. This means that if you use the one style of quotes in your language, you can use the other one in the Rulex expression and still not have to backslash-escape anything.

samatman · on June 10, 2022

This has an almost uncanny-valley effect, in that yes, this is an improved syntax for regular expression, but that has the effect of highlighting why PEGs are the way forward here.

It's great that you can assign variables in Rulex, but they can't reference themselves recursively, so you're still locked out of all the interesting formats. Meanwhile it weakens the "ok but everyone knows how to use regex, it's in the language, and we're just doing a bit of data munging" argument for choosing regex, and you know what? That's a good argument, and it might be the only one.

SonOfLilit · on June 10, 2022

Finally someone else trying to fix regex syntax approaches it from the same angle as mine[1]: a portable syntax that transpiles to plain old regex to get bug-for-bug compatibility but a nicer syntax.

Instinctively, I think the syntax didn't go far enough and could be made more readable (of course I'd feel that way!), but the implementation and marketing has me beaten! Kudos!

[1] https://github.com/SonOfLilit/kleenexp

fathyb · on June 10, 2022

I recommend checking Melody out if you're into readable regex alternatives: https://github.com/yoav-lavi/melody

asicsp · on June 10, 2022

There's also https://github.com/VerbalExpressions for various programming languages.

SonOfLilit · on June 10, 2022

Oooh, it has grown. Nice!

Saint_Genet · on June 10, 2022

To use this you basically have learn regexps anyway, so I can't really see the point.

bryanrasmussen · on June 10, 2022

well I guess the point would be:

1. if you know regex but for a particular language, this takes care of having to learn the different variants of regex because it will compile down to the version you want.

2. it is a more approachable syntax, so easier to learn.

3. because it is still a regex syntax given that you probably know a bit of regex already you can get better at it by learning this.

any others I forgot?

benoketamad · on June 10, 2022

Just like sql abstractions. A dead end, useless, unnecessary.

dry_soup · on June 10, 2022

I don't understand the syntax of the first example:

'Hello' ' '+ ('world' | 'rulex')

To me it looks like there's a + missing after the second single quote

SonOfLilit · on June 10, 2022

That's "one or more spaces"

kreetx · on June 10, 2022

The "+" there is most likely the one-or-more quantifier from regular regular expression syntax.

GordonS · on June 10, 2022

I don't really get the hate for regular regex syntax. And certainly to my eyes, this doesn't look any better to work with that regex.

Take the capturing groups sample, is this really meant to be an improvement?!

Rulex:

``` :([s]) :named([.]) ::named ```

Regex:

``` (\s)(?<named>.)\2 ```

Aloso · on June 12, 2022

Note that rulex replaces the named backreference with a numeric one. Named backreferences are much easier to work with, because you don't have to count capturing groups, which is error-prone and complicates later modifications.

An more accurate translation of the above into a (PCRE) regex would be

    (\s)(?P<named>.)\k<named>

However, named backreferences aren't supported in JavaScript. Rulex makes sure that the emitted regex is correct and supported by the target regex flavor.

ricardobeat · on June 10, 2022

After looking at all the examples nothing stands out as particularly appealing. It ends up being just a more verbose version of regular expressions, but with less clear semantics, harder to see the underlying constructs. Over the years I've become quite familiar with regexp so maybe I'm just biased, but I'd rather have something like CoffeeScript's block expressions instead, where you can easily group and document each part:

https://coffeescript.org/#regexes

ZeroGravitas · on June 10, 2022

I think this does that:

https://rulex-rs.github.io/docs/language-tour/basics/

> In Rulex, whitespace is insignificant, except between quotes. This means that we can add spaces and line breaks to make the code look clearer. We can also add comments to explain what the expressions are doing. They start with a # and span until the end of the line:

point_blank · on June 10, 2022

That was my immediate reaction as well. This doesn't make regular expressions more approachable nor does it appear to bring anything new to the table.

> so maybe I'm just biased

I read 'transpiled' and immediately dismissed it. I think I have a bigger problem lol ;)

bryanrasmussen · on June 10, 2022

Ok, despite being more in the now you have two problems camp when discussing regular expressions I find this interesting enough I should schedule learning it. Question - when compiling I suppose it can turn the regex into the most optimized form - so generation of character classes, lazy quantifiers if pertinent?

gardaani · on June 10, 2022

Swift introduces RegexBuilder to make regex easier, but I think it is a bit too verbose: https://developer.apple.com/documentation/regexbuilder?chang...

Someone · on June 10, 2022

For simple regexes, Swift has short literals, and (AFAIK) you can mix and match the DSL and the short literals. https://github.com/apple/swift-evolution/blob/main/proposals... gives this example:

  // A regex for extracting a currency (dollars or pounds) and amount from input 
  // with precisely the form /[$£]\d+\.\d{2}/
  let regex = Regex {
    Capture { /[$£]/ }
    TryCapture {
      /\d+/
      "."
      /\d{2}/
    } transform: {
      Amount(twoDecimalPlaces: $0)
    }
  }

That gives you control over the verbosity.

Aside: all of this is nice, but does anybody know whether compilation speed (which can be notoriously bad) improved in the latest Swift beta?

miked85 · on June 10, 2022

While impressive, I'm not sure in what scenario anyone would choose to use this.

tiborsaas · on June 10, 2022

Maybe it would be a bit more user friendly to embed it into a product for end users.

donatj · on June 10, 2022

The design of the examples is weird and unintuitive. At the very least should have some sort of clarifying header of “new” and “old” or something. It took me a solid 30 seconds of confusion to realize the regex was the old syntax rather than the new syntax.

The arrow points from left to right, and arrows represent progress but that’s backward from their intent. The new syntax should be on the end of the arrow, not the beginning.

If this were a regex transcoder and the base were the input syntax and the tip were the output code it could kinda make sense, but best I can tell that is not the case here.

The examples read backwards.

bmn__ · on June 10, 2022

Maybe you missed that rulex compile to regex, so the arrow is pointing in the correct direction under that point of view.

wsdookadr · on June 10, 2022

I had a look at one of their examples https://rulex-rs.github.io/docs/examples/emails/

The only part that I can agree to is that it's useful to compose a larger regex from smaller easier-to-understand ones.

But I don't see why a new regex engine is required for that. On the other hand I'm probably overlooking some important features that they bring to the table. Maybe I'll come back with more comments if I invest more time looking at Rulex.

ZeroGravitas · on June 10, 2022

This seems really well thought out.

Would a way to list examples be able to be added to this? I guess it might be possible just with comments and a preprocessor but it would be great to have a way to specify a list of matches and non-matches and know that the regex works with them at 'compile' time. Probably works as inline documentation too.

And is there a standard file extension and syntax colouring available for editors (yet)?

ThatsAllForNow · on June 10, 2022

This might be a really dumb question, but what makes this Portable? Is standard regex portable?

I do not think I am understanding what portability is in this case.

Aloso · on June 12, 2022

When compiling a rulex to a regex, you specify the regex flavor, so the resulting regex is compatible with that regex flavor. This is significant because PCRE, .NET, Ruby, Python, JavaScript, etc. all use slightly different syntax, but you can use the same rulex to target all these different regex engines.

bmn__ · on June 10, 2022

I think the documentation is referring to the point that rulex compile to regex, which are implicitly assumed to be portable across implementations.

MattPalmer1086 · on June 10, 2022

Interesting. It's very similar to a regex language I created for byte-oriented regular expressions [0]

Similar usability principles: delimitated strings, ignore whitespace, and comments.

[0] https://github.com/nishihatapalmer/byteseek/blob/master/synt...

brokenkebab2 · on June 11, 2022

Looking at examples, I can't see where's improvement tbh. Both classical regex, and this new kid equally require training to read because they are both terse. Adding whitespaces, and comments doesn't bring novelty - using your PL of choice syntax you can already split regex in smaller chunks for readability.

SonOfLilit · on June 10, 2022

I've collected the different projects along with a nontrivial syntax example here: https://github.com/SonOfLilit/kleenexp#similar-works

- Regular Expressions - very popular, occasionally reads like line noise, backslash for escape

  (?:What is your (?:name|quest|favourite colour)\?\s?){1,3}

- https://github.com/SonOfLilit/kleenexp - Terse, readable, almost never needs escaping. Python compiler, almost ready VSCode extension, online playground.

  [1-3 'What is your ' ['name' | 'quest' | 'favourite colour'] '?' [0-1 #space]]

- https://github.com/yoav-lavi/melody - More verbose, supports macros, backslash escapes only for quotes. Rust compiler, babel plugin. Improves with time, getting quite impressive.

  1 to 3 of match {
    "What is your ";
    either {
      "name";
      "quest";
      "favorite color";
    }
    "?";
    0 to 1 of " ";
  }

- https://rulex-rs.github.io/ - Very similar to legacy regex syntax, supports macros and number ranges, supports unicode, _amazing_ error messages help convert legacy to new syntax, backslash escapes only for quotes. Rust compiler, as of today no built in way to use outside rust (but they seem to be planning it).

  ('What is your ' ('name'|'quest'|'favorite colour')'?' [s]){1,3}

- https://www.oilshell.org/release/latest/doc/eggex.html Part of a new Unix shell's syntax. Big on composition (macros in kleenexp). Uses backslash for character classes. Production-ready within the shell, not supported elsewhere yet.

  / 'What is your ' ('name' | 'quest' | 'favorite color') '?' ' '? /

- https://docs.raku.org/language/regexes Similar to Eggex, part of Raku (the artist formerly known as Perl 6)

- http://verbalexpressions.github.io/ - Embedded DSL, supports 14(!) languages (to some extent? I didn't verify), but don't seem to have syntax for `(a|b)` and `(?:...){3}`

  const tester = VerEx()
      .then('What is your ')
      .either( // this doesn't seem to be implemented yet (?), so I'm proposing a possible syntax
        VerEx().then('name'),
        VerEx().then('quest'),
        VerEx().then('favorite color'),
      )
      .then('?')
      .maybe(' ')
      .multiple(3); // not sure this is the correct syntax or how to use it in more complex scenarios, hard to tell from tests and discussions in Issues

- There are many more eDSLs, but I will not list them as they are less relevant in my opinion

draegtun · on June 16, 2022

Also have a look at PARSE in Rebol & Red - http://www.rebol.com/docs/core23/rebolcore-15.html

This is how your example would look like in PARSE rule...

  1 3 [{What is your } ["name" | "quest" | "favourite colour"] "?" opt space]

lizmat · on June 10, 2022

If you don't care about the exact white spacing, you can simplify the Raku significantly using the :s adverb:

    m:s/ What is your (name | guest | favorite color) '?' /

Especially in natural language parsing applications, this can be a very handy feature :-)

See https://docs.raku.org/language/regexes#index-entry-regex_adv...

carapace · on June 10, 2022

If you want to see something really neat check out Prolog DCGs:

> A Prolog definite clause grammar (DCG) describes a sequence. Operationally, DCGs can be used to parse, generate, complete and check sequences manifested as lists.

https://www.metalevel.at/prolog/dcg

oblio · on June 10, 2022

Small note to dev: the plural of "regexp" is not "regexp's". It's probably "regexps".

tobyhinloopen · on June 10, 2022

Except for ranges and variables, I saw no clear advantage to this.

Sure it’s a bit nicer, but it’s also something new you have to learn

lozzo · on June 10, 2022

This project looks great. Occasionally I need to write a reg expr (JS developer) and I am not that good at it and I always end up googling around. I am thinking it would be much nicer if I could use rulex to generate my reg expr. But is that what it is supposed to be for ? cheers

Aloso · on June 12, 2022

> But is that what it is supposed to be for?

(Author of rulex here)

Yes, rulex is compiled to a regex. You can use it with the CLI or the online playground and insert the result into your source code. Eventually, I'd like to have a babel plugin that can transform a rulex`...` literal to the equivalent regex, but we're not there yet.

gajus · on June 10, 2022

Got to admire someone who is going to invest time to invent a whole new language just to avoid mastering an existing one. Regex is fine. Been using it nearly daily for the last decade. You get used to it. These days, there are tools that make it incredibly easy to write, test, version and document regex, like https://regex101.com/.

vintermann · on June 10, 2022

It's all too clear that they master the regular expression syntax, because they explain their own in terms of it. And I for one have a hard time reading it.

Regex is not fine. It's been recognized as a source of bugs for decades. Fredrik Lundh's famous quote: "Some people, when confronted with a problem, think ''I know, I'll use regular expressions!''. Now they have two problems."

Claiming that nothing needs fixing here is a lot like claiming we don't need Rust because the C prepocessor does all we need, or that we don't need any fancy build tools because we have GNU autotools. I really see that sort of comment anymore anywhere but Slashdot.

The issue is that ideally you use regex as little as possible anyway, so it's not worth the effort of fixing it and pushing for something better.

The main problem with this proposal is that they use almost no time explaining why they've made the changes they made, and even to understand WHAT the changes are you have to be good at reading already existing regexes (which is harder than writing them!)

That said, one change I spot which I heartily approve of, is reversing the quoting assumption. In regular regex, strings by default match themselves, and need to be escaped to get special meanings. This is a poor design because if you reach for regexes in the first place, it's the special meanings which are important, not the literal matches. Even the first change to regex (egrep) partially recognized this, by reversing the escaping assumption on ?, +, | and parentheses. Going all the way would have been the better decision, and I'm glad to see the authors of this project agree with me on that.

gorjusborg · on June 10, 2022

> "Some people, when confronted with a problem, think ''I know, I'll use regular expressions!''. Now they have two problems."

This is a sort of funny tongue-in-cheek jab, not a factual statement. I'd chuckle if I heard someone say it in real life, but also eye-roll internally. There are some people who will write lines of substr code to trim whitespace from a string. That is the type of person I would expect to crack this joke.

Regular expressions are fine, just like an oil filter wrench is fine. I wouldn't use an oil filter wrench to tighten a nut/bolt. I wouldn't write a single regular expression to validate an email address or implement a lexer.

While I respect the effort put into making this project, I see it mostly as a distraction for new developers. If you are confused by regular expressions, you may be trying to do too much with them and/or you need to put some effort into mastering them (I recommend Jeffrey Friedl's Master Regular Expressions).

If you try to pretend regular expressions don't exist, you are going to have a bad time. You will encounter them, unless you never touch existing code during your tenure.

I also don't think this syntax is better. Regular expressions are terse, and that is a feature. It may feel like an anti-feature in larger expressions, but the secret is that if you are writing huge regular expressions, you are probably doing something wrong.

rtpg · on June 10, 2022

I think the point of that quote is less about error prone regexes and more about regexes not being the right tool for many parsing problems. A new regex language doesn't really tackle that right?

Maybe we need a new syntax generator language...

SonOfLilit · on June 10, 2022

I think you'll love the https://github.com/SonOfLilit/kleenexp readme, then, I provide so. many. explanations. :)

Benjamin_Dobell · on June 10, 2022

> Got to admire someone who is going to invest time to invent a whole new language just to avoid mastering an existing one

Wow, that's unnecessarily inflammatory.

This is the first I've ever heard of Rulex, yet it takes no more than a glance at the homepage to realise the author would have more mastery of regex than 99.99% of developers.

Rulex transpiles to regex, running in a variety of engines. Meaning the author has not just mastered one regex dialect, but the subtleties of all supported regex engines (PCRE, JavaScript, Java, .NET, Python, Ruby and Rust).

Someone · on June 10, 2022

> Rulex transpiles to regex, running in a variety of engines. Meaning the author has not just mastered one regex dialect, but the subtleties of all supported regex engines

Not necessarily. More likely, Rulex supports the common subset of features of those engines, with, where/if necessary, small syntactical differences in the output, ignoring many subtleties of implementations.

Aloso · on June 12, 2022

(Author of rulex here)

It's a middle ground, actually: When compiling a rulex, you specify the target regex flavor. Some features (e.g. lookaround and backreferences) aren't supported in every flavor. Some features are polyfilled by rulex, so they work even when the targeted regex engine doesn't. However, there are also features supported by PCRE that aren't available in rulex yet. Notably, atomic groups, recursion, character class intersection/subtraction and some mode modifiers aren't supported yet. Eventually I'd like to support all of PCRE's features.

coldtea · on June 10, 2022

>Got to admire someone who is going to invest time to invent a whole new language just to avoid mastering an existing one.

Yes, because any existing language is the global optimum, and we should always strive to reuse it, never invent any new one...

I mean, what's the problem with stone cavemen-era cart-wheels? Why reinvent the wheel?

Not to mention the implying that the only reason one would try to invent a new language is to "not invest time mastering an existing one". I.e. "It's not that this guy tries to improve regular expression languages - it's that he's dumb/lazy to master the standard regex expressions".

Never mind that this new language also transpiles into regular regex, so clearly the person who created it has also "mastered the existing one".

Underphil · on June 10, 2022

You missed the obvious "why use C when assembly language exists?".

SonOfLilit · on June 10, 2022

Working with a company that writes hundreds of regexes per day, I can say with confidence: regex syntax is _not_ fine. People with lots of experience make obvious mistakes and lose them in the line noise all the time.

donatj · on June 10, 2022

What could the job possibly be? Based on next to nothing, sounds like you might be better off building custom parsers?

chrisshroba · on June 10, 2022

I'd guess any large tech company on average is writing at least hundreds per day, given the crazy amounts of code being produced.

xdfgh1112 · on June 10, 2022

Do you use the /x modifier?

SonOfLilit · on June 10, 2022

The regexes are very simple, there's just thousands of them. /x wouldn't help much.

ur-whale · on June 10, 2022

> Regex is fine

Hrm, I don't think regex is fine.

It's well established yes.

It's dense, concise and minimalist, yes.

It is somewhat powerful, yes

But there's a lot of places where it comes short:

1. Very steep learning curve for beginners, especially if they have to read a regex they didn't write themselves

2. Write-only language, just like APL: dense and concise, but a giant PITA to read once you've walked away from a complex regex for a couple of days.

3. Not standardized (sed regex vs. sed extended regex vs. perl regex vs. vim regex vs. bash regex vs. globbing vs. whatever else, it's never the same, what a giant effing PITA)

4. No proper integrated debugger (there might be some 3rd party out there, but it's never around when you need it, as in: integrated in whichever environment you are currently doing regexes in). When your large and complex regex fails good effing luck trying to figure you why, short of breaking it down in tiny pieces and debugging each in turn recursively. You don't even have access to a bloody printf.

5. Boolean trees of regexes (match blah OR blech AND NOT blargh, where blah, blech and blargh are themselves boolean trees of regexes) are a giant PITA to express within the language.

6. Overlap and confusion between capture '()' and factoring out an expression to plug into a dyadic OR.

7. No way to factor out common and/or repeating parts, like you would in a normal programming language.

So any attempt at improving this mess is welcome in my book.

bmn__ · on June 10, 2022

> So any attempt at improving this mess is welcome in my book.

Already done, I think. :)

2. Not a big problem if the responsible programmer employs named captures and named subpatterns, and switches on extended mode and makes use of whitespace and comments. Like with any other programming environments, it's a matter of wanting to write readable code, and perhaps enforcing this through technical and social means.

3. I observe that Perl-style regex have become the de facto standard, in the sense that numerous programming languages in common use have adopted it, and preceding styles are marginalised to the point that any discussion about regex that does not mention the style makes the assumption of Perl-style.

The next points are actually implementation dependent. It is not fair to condemn regex as a whole just because some implementations fall short.

4. Perl has good debugging tools for regex.

5. This is easy to express in Perl.

6. Non-capturing grouping exists.

7. Named subpatterns exist, interpolation of variables containing regex values exists.

oblio · on June 10, 2022

Perl is a dying language. Your 4-5 "solutions" are non-solutions in practice for 99% of software developers out there.

And regexps are by default written in the concise form. It's the standard problem of "opt-in" vs "opt-out".

Opt-in, as a rule of thumb, has low adoption rates, for various reasons (maybe 20% if you get lucky).

Opt-out is the inverse, it has adoption rates of 90%+, except for total, universally known disasters.

Tools which are unsafe by default are bad.

Yes, the craftsmen are bad, but the way we improve is by also improving our tools. It doesn't matter how much pointing at the idiots we do, practical regexps have been around since ~1975, that's almost 50 years. In industrial use they haven't gotten safer.

That's bad.

asicsp · on June 10, 2022

And https://www.debuggex.com/ for visualizing regex as railroad diagrams.

I'd agree that we get used to regex with usage, but need to be extra careful to test all the corner cases (since by design, a pattern matches multiple things).

XorNot · on June 10, 2022

RegExr (https://regexr.com/) doesn't come up enough in these discussions. One of the nicest regex debugging/development tools on the internet today.

avgcorrection · on June 10, 2022

I genuinely admire someone who wants to improve on the regex DSL design.

Clearly they will master regexp in the process since this new language compiles to regexp.

Aloso · on June 12, 2022

(Author of rulex here)

Creating rulex has been quite a journey, during which I learned a lot about regex syntax, its quirks, and the differences between various regex flavors. I already knew a lot about regexes before I started this project, but creating a language that compiles to regexes requires a much deeper and more thorough understanding of them. I didn't create rulex so I don't have to master regexes. I created it so that people who use it don't have to master regexes -- a language that is riddled with poor design choices. Yes, you can get used to it, just like you can get used to working with a dull plastic knife. The fact that people need a website such as regex101 or regexr just shows, in my humble opinion, that regular expressions aren't as straightforward or easy to understand as they could be.

tester756 · on June 10, 2022

>Regex is fine

Regex and Git are the worst really popular APIs up to date, it's sad because both are really good and powerful, but their APIs just suck.

Regex maybe is fine in config files, but using it in programming languages that have way better expressivness it is just painful.

I'd really want to replace Regex with something like C#'s LINQ

ffhhj · on June 10, 2022

I want to allow users create and share RegEx's in my website, but I'm afraid there could be some abuse like building a very complex one that would get Javascript stuck or crash. Are there any known vulnerabilities and RegEx sanitization?

padjo · on June 10, 2022

Feels like if you solve a problem with this you’ll end up with three problems

yesenadam · on June 10, 2022

How standards proliferate https://xkcd.com/927/

tekknolagi · on June 10, 2022

Also notable is Remake, which has Rust bindings: https://github.com/ethanpailes/remake

nsajko · on June 10, 2022

The owner of that repo had an interesting regexp-related thesis ("Skip Regex"):

https://github.com/ethanpailes/thesis

https://www.proquest.com/openview/e7295f20d24f203900f6291d6d...

pietroppeter · on June 10, 2022

Reminds me of https://simple-regex.com/ a direct translation of a regex into natural language

flakiness · on June 10, 2022

Nice! Can be a Markdown for Regex, meaning nicer (possibly less powerful) replacement of old syntax - If there is enough ecosystem grown.

ngvrnd · on June 10, 2022

"Now you've got three problems."

Thaxll · on June 10, 2022

Lot of work went there to re-invent the weel.

love2read · on June 10, 2022

When would something like this be necessary?

nyellin · on June 10, 2022

Every time you write a non trivial regex!

Regexes often contain complex parts that repeat themselves throughout the regex or can be combined into higher level patterns. This adds support for that.

chii · on June 10, 2022

> Regexes often contain complex parts that repeat themselves throughout the regex or can be combined into higher level patterns.

sounds like a parser use case.

xdfgh1112 · on June 10, 2022

One of the coolest things about Raku/Perl6 was how it made grammars a first-class citizen, like how Perl 5 made regexes a first-class citizen.

Diris · on June 10, 2022

That reminds me of that nice article about how to do that kind of thing in python

[0] The unreasonable effectiveness of f-strings and re.verbose https://news.ycombinator.com/item?id=31457188

bmn__ · on June 10, 2022

You make it sound as if regex don't have this ability, but they already do.

asicsp · on June 10, 2022

If supported, you can use verbose flag and subexpression calls for such cases.

From a cursory glance, I found only the numeric `range` support the standout feature I'd want in existing regex support.

eternityforest · on June 10, 2022

Looks like it would be good if you had to use some ancient regex engine.

Otherwise, I find regex is extremely easy, since the problems it solves well are very simple, and one can develop it interactively by trial and error, by looking at the highlights in their editor.

I could possibly see actually using this if the playground had a similar "Try it on real text instantly" feature, but I've never had a problem with traditional regex+trial and error.

an1sotropy · on June 10, 2022

you may be resigned to trial and error being an essential part of regex development, but the reliance on trial and error should be (1) a signal that something can be improved with how programmers write regexs, and (2) a symptom of a programming ethos that essential things should be difficult.

I'm glad that this project (and others like that others have highlighted here) allow us to improve on (1), and I am always encouraging my profession peers in computing to reconsider (2).

SonOfLilit · on June 10, 2022

Guessing you were downvoted because of the "ancient" part that has nothing to do with it. Also, many people don't agree with the "regex is fine" part.

SonOfLilit · on June 10, 2022

The answer is the same as that for more expressive languages with more ergonomic syntax and less footguns vs C.

ilyash · on June 10, 2022

Here is my take on pattern matching. Tldr: generalize.

https://github.com/ngs-lang/ngs/wiki/UPM-Design