Hacker News new | past | comments | ask | show | jobs | submit login
The Greatest Regex Trick Ever (2014) (rexegg.com)
582 points by signa11 on July 8, 2021 | hide | past | favorite | 191 comments



Let me take these PhD level regex down to elementary school awesome.

I have a process table and I want to grep it for the phrase "banana":

ps auxww | grep banana

root 87 Jun21 0:26.78 /System/Library/CoreServices/FruitProcessor --core=banana

mikec 456 450PM 0:00.00 grep banana

Argh! It also greps for the grep for banana! Annoying!

Well, I'm sure there's pgrep or some clever thing, but my coworker showed me this and it took me a few minutes to realize how it works:

ps auxww | grep [b]anana

root 87 Jun21 0:26.78 /System/Library/CoreServices/FruitProcessor --core=banana

Doc Brown spoke to me: "You're just not thinking fourth dimensionally!" Like Marty, I have a real problem with that. But don't you see: [b]anana matches banana but it doesn't match 'grep [b]anana' as a raw string. And so I get only the process I wanted!


This almost always works, but it won't if the shell expands your bracketed letter. See for example:

    $ echo [b]anana
    [b]anana
    $ touch banana
    $ echo [b]anana
    banana
You can escape the bracket and it will work:

    $ echo \[b]anana
    [b]anana


I tested with zsh, apparently even if there's no matches it still complains.

Escaping works under zsh. My preferred method is single quotes:

    echo '[b]anana'


I remember reading about this trick 20-some years ago, but it's still as good now as it was then.


Which zsh version? On 5.7.1, echo \[b]anana works in both cases


This is really clever... I usually ended up with adding

  | grep -v grep
like in

  ps auxww | grep banana | grep -v grep


this does tend to play havoc with $? values if you're used to using those to test grep's results.


You can always just reverse the greps.


the simple things in life elude me. holy cow. i learned the "tag a grep -v grep" at the end and never looked into refining it. but just flipping the greps? nope, not once did that ever occur to me. thanks


That's a valid concern, but in this case it won't cause problems -- grep exits with 0 if a line was selected; in the case of `grep -v grep` that means there was a line without "grep", which is what we want.

(Also @thewakalix made a good suggestion to reverse the greps.)


except, your grep -v portion will always return a result so $? will also always return 0. even if your grep banana did not find anything.

the reversing the greps will be my new default behavior


Nope, try it out:

  $ printf 'banana\ngrep banana\n' | grep banana | grep -v grep
  banana
  $ echo $?
  0
  $ printf 'grep banana\n' | grep banana | grep -v grep
  $ echo $?
  1
To clarify my previous comment, `grep -v grep` exits with 0 if there was a line without "grep" in the output of `grep banana`.


In bash, you can get the rc of piped things, the var eludes me while walking..


${PIPESTATUS[@]} for a space delimited list of all exit codes, or replace @ with the position in the pipe chain of the specific command you want.

ps auxwww | grep banana | grep -v grep && echo ${PIPESTATUS[1]}

type of thing


You can also set -o pipefail. The first non-zero exit code is returned if there is one.


wow. and here i've been ps auxwww | grep banana | grep -v grep all this time

edit: saw someone else posted this as well. should have known


Messes with the highlighting, you need to tag another | grep banana on the end ;)


but what's wrong with pgrep -f though? I don't want to search for clever trick every time I need to grep a process


pgrep is great, but note that you can still encounter this problem if you run pgrep in parallel -- it will never match its own process, but it will match other pgrep processes.

So for example if you have a script that uses `pgrep -f banana` to search for a "banana" process, and you run that script twice in parallel, pgrep might see the other pgrep process and think "banana" is running even though it isn't.

I was bitten by this :)


I use prep -laf the-wanted-string https://man7.org/linux/man-pages/man1/pgrep.1.html

But nice regex though

Edit : someone already posted that solution https://news.ycombinator.com/item?id=27777901


    ps auxww|sed -n '/ doesnotexist /d;/banana/p'
When/if grep is not available


What systems don’t have grep? Off the top of my head, anything vaguely POSIX compliant would have it, and anything descended from 4th Edition Unix would also have it, which seems like it would cover everything.


"What system don't have grep?"

Emedded systems is one answer.

Toolchains for compiling systems from source is another answer, e.g., NetBSD's toolchain has sed, but not grep.

A third answer is install media. For example, NetBSD install kernels have ramdisks with sed but not grep.

A fourth answer is personal, customised systems. I create small systems that run from RAM. I run these on small computers with limited resources. When one of these computers first boots up, it may not have a "full" set of userland programs. It may not have grep. I am not inclined to use the limited space available to include grep at such an early stage if I can get by with sed.

Hope this answers your question.


The most common cases today are probably Docker containers, which are often based on the most minimal image possible. Alpine doesn't have grep iirc.


> Hope this answers your question.

It does, thanks for the detailed response.


Maybe someone deleted grep by accident? Or maybe you accidentally traveled in time to the 70's and need to use a nearby PDP-11 to calculate a way home? Lots of possibilities.


I give up. I have to know why this works. Please, tell me.


The regex `[b]anana` matches the string "banana". But it does not match the literal string `[b]anana` --- to match that, you'd need something like `\[b\]anana`. The literal string `[b]anana` is what shows up in the process table for grep, and so it doesn't match.


Ooooooh, that's brilliant. You've made the self reference no longer a self reference.


This is really cool. This should also work:

    ps auxww | grep b\\anana


_applause_

Never thought of that. Nice.


The more general tip is that a single regex isn't the only tool you have. You don't have to get your final product one one step. Almost every "disaster" regex comes from someone trying to do too much at once.

One other solution would have been to run the regex twice, once to pick up all instances of Tarzan, and a second on the results of the first to filter out all instances of "Tarzan".


This is also often the problem with disaster SQL queries. I've seen some monsters that got hopelessly tangled up in their own JOIN constraints, trying to fetch all the data in one roundtrip because OMG LATENCY but then having to do full table scans over large-ish tables instead. Rewriting it as three small indexed queries reduced the runtime from 40 minutes (!) to less than a second.

Don't do too much in one operation whether it's regexes, SQL queries or OOP classes!


Do you have an example of such a query? Maybe Common Table Expressions would have been enough instead of multiple roundtrips.


I’ve used the pattern several times of “select these SQL objects into a cursor, then iterate over the cursor to assign/revoke/check permissions on the objects”.

It’s still in a single batch of SQL (stored procedure in our case, so no additional network roundtrips), but the code is vastly clearer to read/maintain this way.


In which cases are row-by-row loops clearer than set-based sql?


They're clearer in the sense that it makes it easier/possible to do things like:

While maintaining/changing the SQL, comment in/out select-statements-as-printf-debugging, and comment in/out actual execution of the statements themselves.

These cursors would often contain [identifying object reference], [category of statement], [text of SQL statement to execute]. You would write a select statement to populate the cursor, then loop over the cursor to run all the statements in the order you wanted (drops, then user/role creates, then grants, or whatever the situation called for).

It's not about logical clarity, but practical maintainability given the (overall weak) state of tooling for database queries. Is it a bastardization of SQL to do something that "should be" done in another scripting language? Maybe, but there's a lot of power in giving the DBAs tooling that works exclusively in a language and environment that's familiar for them rather than splitting it across SQL and python/tcl/ruby/whatever. Not nearly every competent [relational] DBA is competent across multiple languages. Every competent [relational] DBA is competent in SQL.

Is it even possible to use set-based SQL to call EXEC SQL EXECUTE IMMEDIATE or sp_executesql on each statement in a set?


I’m pretty decent with regex, but I often break complex regexes down into multiple steps for better clarity and easier debugging. Sure, you can use extremely clever one-liners, but the next maintainer of your code may hunt you down and murder you on the spot for wasting weeks of their time.


to your point (and how to fix it in a way that seemingly nobody does) - you can make complicated regular expressions pretty simple by using named groups, and ignoring pattern whitespace because they allow you to logically separate different components and specify intent. Nobody would ask a fellow dev to debug javascript where it is all on a single line and every variable name is a1, a2, etc. Except people do it all the time with regex. Its insane. Hell, you don't even get the blessing of a1, a2, a3. It is all unnamed. Insanity.

Some rare people can figure out:

\d{1,2}[-/]\d{1,2}[-/](\d{4}|\d{2})

but a dummy can figure out this:

(?<month> \d{1,2} ) [-/] (?<day> \d{1,2} ) [-/] (?<year> \d{4} | \d{2} )


And, as always, the "next maintainer" is most likely your future self. So conversely, it's really nice to give your future self the gift of clarity and con-conversely look back and say "Thanks past me!"


A big source of trying to do too much is environments that offer easy regex-based transformations defined as a pair of regex and a single replacement string (that may contain references to matching groups) and make other transformations hard ("while find + rest"). When you have the option to provide a "process match" closure instead of the replacement string the lure of putting too much into a single regex almost collapses.


Caveats apply. A regular expression isn't just a way of saving yourself a few lines of explicit string manipulation. It's describing a state machine that does these text operations efficiently (in some programming environments, that state machine will get optimized and compiled down to metal prior to first use).

If you're matching a couple short strings, sure, don't bother overthinking the regex. If you're matching a lot of them, and/or they're long, then the extra time spent on making a single regex work will be worth it. The regex will work smarter than your hand-rolled code, and it also won't waste memory returning partial results.

Also: in my experience, almost every "disaster" regex comes from people not bothering to document and test what they write.


This is the correct answer. Be less clever. Makes life much simpler for whoever has to maintain your code (which may well be you).


I got the feeling that a lot of those »I have to do this in a single regex« questions come from places where a single regex is basically the only API you have available. Something like form input validation where the framework provides a handy regex it uses for validation, but doesn't expose actual validation callbacks or events do to the same in code without having to redo everything around it. It's only a hunch, but when I have the opportunity to use code to validate a string I probably wouldn't assume that code to be a mandatory one-liner, even as a beginner developer.


This trick may be thought of as a simplification of the systematic approach to parsing stuff, that is the lexer-parser division of responsibilities.

The lexer uses regexes but only for splitting the input stream of characters into tokens. Identifiers, integers, operators, strings, keywords, opening brackets and whatnot - each type of token is defined by a regex. This part is hopefully deterministic and simple, although the lexer matches regexes for all kinds of tokens at once, which is why lexer generators are often used to generate lexers.

The heavy lifting is done by the actual parser which tries to combine the tokens into something that makes sense from the point of the grammar.

So in this trick the sub-regexes between |'s define the tokens (the lexer part) while the group mechanism selects the single token that we want to keep (a very very simple parser).


The ? syntax group has to be the most unmemorable of the bunch. I've used it maybe over 1,000 times or so and I still have to look up ?: Or ?! ?< or whatever else.

I used to have a laminated sheet on my wall at an office because it was so terribly bad.


Sub-expression or capture group: (foo)

Named capture group: (?<name>foo)

Non-capturing group: (?:foo)

Lookahead: (?=foo)

For negative lookahead, change = to !: (?!foo)

For lookbehind, add <: (?<=foo)

For negative lookbehind, change = to !: (?<!foo)

(Not from memory, had to look everything up...)


> (Not from memory, had to look everything up...)

right, great list but I'll forget it all by tomorrow.


I'm not sure if any regex library exposes this, but since regular languages are closed over compliment and intersection you could theoretically do something like match("....string..", regex("Tarzan") - regex("\"Tarzan\"")), where the - operation is shorthand for intersection with the compliment. Does anyone know if any regex libraries expose these sorts of operations on the regular expression/underlying DFA?


Intersection with the complement will not work here.

Because the idea is to match Tarzan, but only if it is not preceded and followed by a quote.

Regex intersection and complement do not perform look-behind or trailing context.

Live demo:

  This is the TXR Lisp interactive listener of TXR 265.
  Quit with :quit or Ctrl-D on an empty line. Ctrl-X ? for cheatsheet.
  TXR may be used in areas that are not necessarily well   ventilated.
  1> [#/Tarzan&~"Tarzan"/ "Jane shouted, \"Tarzan\""]
  "Tarzan"
  2> [#/Tarzan&~"Tarzan"/ "Jane shouted, \"Tarzan!\""]
  "Tarzan"
The &~"Tarzan" makes absolutely no difference. The reason is that Tarzan matches exactly one string. The complement ~"Tarzan" matches a whole countable infinity of strings, and one of those is Tarzan. The intersection of that infinity and Tarzan is therefore Tarzan.

Intersection with complement is useful like this:

Search for a three-character substring that is not cat:

  3> [#/...&~cat/ "hat"]
  "hat"
  4> [#/...&~cat/ "dog"]
  "dog"
  5> [#/...&~cat/ "doggy"
  "dog"
  6> [#/...&~cat/ "cat"]
  nil
  7> [#/...&~cat/ "scatter"]
  "sca"
  8> [#/...&~cat/ "catalan"] ;; "cat" is skipped, then "ata" works.
  "ata"


Greenery (python3) let’s you manipulate regular expressions and do things like compute intersections: https://github.com/qntm/greenery


This is exactly the type of thing I was thinking of, and seems quite fully featured - thank you!


Wouldn't that end up just being the same as 'regex(Tarzan)'? Those regexes can't match the same thing, they can only overlap.

What you want is something like all matches of regex("Tarzan") not contained in a match for regex("\"Tarzan\""), which is a bit trickier. That would require something like:

regex("Tarzan") - all-substrings(regex("\"Tarzan\""))

and I'm not sure regular languages are closed over the "all-substrings" operation. Actually I'm pretty sure they aren't.


Unfortunately (or perhaps fortunately), “regexes” as commonly implemented in programming languages are only loosely related to regular expressions from automata theory. With all their extensions, they can recognize much, much more than just regular languages, and I don’t think they’re closed under complement (though I’m not sure). However, most regex engines have a feature called negative lookahead assertions, (?!do not match), which would almost work in the way you suggest.

You have to be careful about inputs like this though: “Inside a string”Tarzan”Again inside a string”


Yeah, a DFA that recognizes a regular language can easily be implemented with O(n) worst case behavior.

My attitude is generally that one should use regexes for matching regular languages and if one needs a stack or even Turing completeness then handle that in code around the regex.


> compliment

I’ll take that as a complement.


"It's a complement... NOT" - Borat.


Not exactly that but take a look at https://github.com/mtrencseni/rxe ("literate regex"). I found this on HN and recall the comment thread being good but I can't find it now.


This perhaps? second result on hn.algolia.com. https://news.ycombinator.com/item?id=20646174


I kinda love that this is written like an infomercial.... for regexes. I've gotten far enough to think, oh, I think I actually just had this problem recently and didn't know how to solve it with regexps, but I'm still not to the part that actually tells me the One Regex Trick, and I'm still reading!


This article is kind of a bait and switch actually. It first states:

>we want to match Tarzan except when this exact word is in double-quotes

and so the reader might start thinking of ways to "match" this. The author then starts to mention ways to do this, but at the end, their trick is actually to not "match" it, but to remember it in a group. This will not match what the author says it will match, because if you do regex.test(string) it will return true when "Tarzan" appears, because it is in the or statement.

It appears the author is very good at storying-telling though.


In the very first section (albeit after from introduction, but still before the Tarzan example you show), it clearly states the limitations:

> Before we proceed, I should point out some limitations of the technique:

The author clearly states you may have to add one or two extra line of code, in your case regex.test(string) may become `regex.match(string).group(1).length` > 0 or something along those lines.

The author explicitly states:

> so it will not work in a non-programming environment, such as a text editor's search-and-replace function or a grep command.

But in a programming environment, I will choose to have one more line of code over the extremely hard to read alternative regexes.


It wasn't really phrased that way, but that was really the core insight of the piece -- if you take a step back and look at the problem from the context of the code running the regex instead of within the regex itself, it's much simpler. Use all the tools at hand to get the job done.


Is that a bug? Certainly feels like one. Try it with software that does not blow goat dicks, and it does not disappoint:

    $ perl -E'say q("Tarzan") =~ /"Tarzan"(*SKIP)(?!)|Tarzan/ '
    $ perl -E'say q(Tarzan)   =~ /"Tarzan"(*SKIP)(?!)|Tarzan/ '
    1

    $ printf '"Tarzan"' | pcre2grep '"Tarzan"(*SKIP)(?!)|Tarzan'
    $ printf 'Tarzan'   | pcre2grep '"Tarzan"(*SKIP)(?!)|Tarzan'
    Tarzan


I dunno, the "logic" solution seems like the obvious one to me; if your boss really has that much trouble with propositional logic that they don't immediately see why it works, well, that's what code comments are for.

(...the trick is still cool, though; I can imagine other situations where it would be more useful. However it does seem like it potentially depends on the particular regex engine being used, in contrast to the author's claim about it being totally portable; yes, it'll compile on anything, but will it work?)


How could it not work. I've regularly relied on order or matching, and never found an environment that didn't test left-to-right for the `|` operator in regex.


> operator in regex.

regex is not regular expressions - if using NFA to match then you're matching all alternates simultaneously.

Russ Cox has good pictures explaining idea in 'Regular Expression Search Algorithms' section of <https://swtch.com/~rsc/regexp/regexp1.html>


I'm talking about regex. Regex libraries in practical use do not use NFA. I'm talking about actual code that's written using normal languages. I'm familiar with the difference between "regular expressions" as in "regular languages".


Lex/Flex, wich I think we can agree is used by "actual code that's written using normal languages" use DFAs, both inside rules and between rules, and they do not try '|' cases left to right (They probably could have if they wanted since there is a REJECT action that already force them to store the list of all the rules/texts that were matched):

a|ab {cout << "matched ab" << std::endl; } b { cout << "matched b" << std::endl; }

if provided with "ab", will match the first rule with "ab", and not the first with "a" then the second with "b".


All POSIX compatible regex engines do the same. It's somewhat linked to why POSIX regexes don't have non-greedy operators.

But DFAs can implement the preference-order semantics found in backtracking regex engines too. Russ Cox's articles show how to do that.

(Just adding some additional info to your point.)


Go's regexp package, Rust's regex crate and RE2 are examples of regex engines that are very much in practical use that use NFAs (among other things).


PCRE is a pretty well-defined standard, isn't it? And it's the one used by most of the languages I've worked with, including in MariaDB.


It doesn’t even rely on PCRE, just core regex.


Long Page with practical regex advice for programmers, most likely not useful for command line warriors

Lookbehind

Lookahead

Advanced handling of tags

Replace before matching

the best regex trick ever:

"Tarzan"|(Tarzan)

The whole site contains useful regex advice


Very long build up to what is definitely a neat trick, although without SKIP FAIL, it might cause explosive growth in the memory usage as it allocated space for the results you don’t need (unless you use a streaming regex option).

Speaking of lengthy: this site breaks the iOS Safari scroll bar! It just disappears altogether (even when scrolling up or down to make it show, like you have to these days to please the UX designers in Palo Alto).


The scroll bar works but for some reason it gets rendered very bright. Scroll all the way up to the black background in the header and you’ll see it.


> "Tarzan"|(Tarzan)

OK that's pretty clever (I certainly never thought of putting a capturing group inside only one side of an "or")...

...but it doesn't seem particularly useful? It probably won't work in most cases where this is just part of a larger expression. You're usually using capturing groups in a particular way for a good reason, and this would mess that up.

In contrast, the lookbehind+lookahead way is the "proper" and intuitive way to write it, and works as part of any larger expression.

So... +100 points for cleverness, but don't actually use this please. :)


> In contrast, the lookbehind+lookahead way is the "proper" and intuitive way to write it, and works as part of any larger expression.

I would say, the "proper" way is to have a separate line of code validating what's not there :)


I'm not following?


Not GP, but I'd go a very simple and verbose way, maybe that's what they meant to. Match:

    (.)Tarzan(.)
Then in an additional line of code assert

    (Group 1 == Group 2) ≠ "
This shifts the logic out of regex and into the surrounding programming language context. That's arguably better, but the resulting regex is extremely dull and unclever.


Don’t forget to look out for matches at the boundaries of the original string. I think it should be something like:

    (^|.)Tarzan(.|$)
Though I’m not 100% sure offhand what the result in the capturing groups would be.


I think dumb, brute force, simple approaches like this are underrated. Writing elegant, pithy code that pleases you aesthetically is nice but writing code that's explicit and obvious and can be maintained by the new kid is often more pragmatic.

Save the clever stuff for where it's needed.


I mean, I guess if nobody on your team understands regexes.

But generally, once you decide to use a regex in the first place, you might as well put as much regular everyday logic as you can in it. Otherwise you might as well look for "Tarzan" with a dumb string search.

Lookbehinds and lookaheads aren't rocket science. And you can always leave a comment about what they're doing if you're worried other team members won't grok the syntax.


> Lookbehinds and lookaheads aren't rocket science.

Lookbehinds and lookaheads (especially negative lookbehinds) are rocket science.

What is "rocket science?" "Rocket science" is the feeling you get in math class where the instructor explains a proof to you in the clearest possible terms and you just don't get it. You have to listen to the explanation multiple times, preferably in a few different ways, and then you have to sleep on it, and then you get it, maybe.

But "rocket science" isn't just hard to understand. It's a hard problem where the consequences for failure are catastrophic. When you fail at rocket science, a multi-million dollar rocket explodes.

Anyone who's ever tried to teach lookbehinds to a newbie has seen it: you explain how lookbehinds work, and then ask the newbie to create a regex with negative lookbehind, to demonstrate mastery. I've done it a few times, and they never get it right, ever.

At best, they flub the syntax, but even once they get over that, they usually write the worst possible regex: a regex that works correctly on desired inputs but does the wrong thing on the input the regex is designed to reject.

This is a notorious problem with writing regexes, but it's way worse for negative lookbehind, because it's asserting that something isn't there, rather than querying for something that is there.

When I see a regex with negative lookbehind during code review, I ask for unit tests, not just comments. Reliably, regexes get even more complex when unit tests are added, because it's just so damn hard to write a correct regex with negative lookbeind.

I've never used the "trick" from TFA before, but it already sounds way easier to use than negative lookbehinds, and I'm curious to try it.


I agree on unit tests for non-trivial regexes as a general rule, but respectfully disagree on lookaheads and lookbehinds.

Things like greedy vs. non-greedy matching, matching newlines or not, handling Unicode correctly, inserting a capturing group when you actually needed a non-capturing group, making sure your regex works if it matches the start or end of a string, escaping characters -- those can be tricky.

On the other hand, lookaheads and lookbehinds are conceptually extremely straightforward, you just need a cheatsheet to remember the syntax is all.


Ha. Of all the things I learned at university, rocket science was the easiest to get. Quantum mechanics on the other hand sucked.


> Otherwise you might as well look for "Tarzan" with a dumb string search.

Yes, this was sort of the idea as well (also see sibling response). I'd just as soon have 2 lines of code rather than a regex.


> I mean, I guess if nobody on your team understands regexes.

If anybody on your team doesn't understand regexes, you mean.


Yeah, that's more or less what I meant. Write a regex (plus line of code) to make sure `Tarzan` appears. Then write another regex and line of code to make sure `"Tarzan"` doesn't appear.

Maybe at this point you aren't using regex even. Nice, you solved two problems.

(I do appreciate regex and even use them a lot. But, I use them enough to avoid them as much as possible.)


Part of me reads these things and I'm like "neat trick", but most of the time they more-or-less prove to me that Regex is doomed to a steady and slow decline.

It's just not a particularly good "interface" for the task it is intended to achieve, a little more ability to be "verbose" at the possible price of succinctness I think would go a long way. I'm more-or-less waiting for the "blank" in: "blank" is to Python what Regex is to Perl.


I dream that we will have something like Copilot but exclusively for regex and working marvelously

"Find every 2nd instance of a dollar amount that is not encased in quotes" outputting <insert regex here> would be awesome


Please no. If you can’t understand the code, how can you possibly verify that what copilot or similar has produced is correct?


The same way I verify that my own regex is correct: by running a few test cases and then crossing my fingers.


Just like programming tricks in any language supported by copilot:

a. Code, once provided, can be broken down and understood at a far easier level than is required for composition;

b. Worst case, try several test cases to both increase comprehension and reduce the chance of 'gottcha's.

Shouldn't be too hard to stick with option 'a' as clear best practice, looking up any operators or syntax that aren't immediately obvious, the advantage being that the AI can use obscure tricks that you aren't initially aware of but you still have the opportunity to review and understand the regex, becoming better over time. It's theoretically auto-generated, but practically computer-assisted.


That is the opposite of what most people will say. To most, reading someone else’s code is much harder than constructing their own. Try this one:

    @P=split//,".URRUU\c8R";@d=split//,"\nrekcah xinU / lreP rehtona tsuJ";sub p{
@p{"r$p","u$p"}=(P,P);pipe"r$p","u$p";++$p;($q*=2)+=$f=!fork;map{$P=$P[$f^ord ($p{$_})&6];$p{$_}=/ ^$P/ix?$P:close$_}keys%p}p;p;p;p;p;map{$p{$_}=~/^[P.]/&& close$_}%p;wait until$?;map{/^r/&&<$_>}%p;$_=$d[$q];sleep rand(2)if/\S/;print


Smartass answers only. Ask autopilot to write a proof with --nojargon so juniors will get it! Write a single unit test and call it good! Step through it in a debugger on that one unit test to be sure? Sure of what? I dunno but it sounds diligent...

When I watched Idiocracy, a small optimist in me said "but surely the techies..." That optimist has died. We're fucked


Probably the way most people do - they run it over whatever examples they can think of at the moment as a check, and then forget about it till it breaks.


> I'm more-or-less waiting for the "blank" in: "blank" is to Python what Regex is to Perl.

This will sound like a forced joke but I genuinely didn't understand your phrase. I got stuck re-reading several times the "blank" in: "blank" part, but my mental language regex wasn't matching the expression.

I think the bug is caused by a bogus quote that causes a bad parameter expansion. My regex engine parses this better: the "blank" in: "blank is to Python what Regex is to Perl"

Off by one errors...


They're waiting for the X in "X : Python :: regexp : perl" - does this help?


Haha, I love your explanation of this! Human communication is difficult, and using syntax incorrectly makes it even moreso


> I'm more-or-less waiting for the "blank" in: "blank" is to Python what Regex is to Perl.

Parser combinators


Regex reminds me of sendmail.cf. Very clever and powerful, but no one writes configuration files that way any more.


I have to say I agree. Regex is arcane in the most bug-prone way.


This is a great trick. It says something about RegEx syntax that matching a simple rule with a relatively clear expression is a major accomplishment.


Yup. Regex is not a silver bullet for "match stuff", and it is wrong(ish) tool for following jobs:

- context sensitive matching

- matching with multi-char-exclusions

(regex is happy the most, when it's used to match "regular language" things)


Know saying wrongish, but for multi-char-exclusion matching, can't you just do [^chars]?

So to search for words without any vowels just 'grep [^aeiou]'?


It's semantics, not syntax. It's one of the simplest and oldest search engines in the history of computing; obviously a more complex engine can provide more robust semantics.


My neat trick is to not use regex and avoid inflicting a novel of explanation on my coworkers.


Very verbose writing for a very succinct regex.


2600 word lead-up to "Tarzan"|(Tarzan)

This style of writing is just obnoxious.


Regex is great and I love it, but the greatest trick is to know when you need to write a parser instead.


of all the things ever invented in software, regex still amazes me.

It's almost like nature, many simple rules coming together to make extremely clever and fairly complex ideas


It took me over 15 years until I started to willingly use RegExp, but now I can't live without it. It's like the curse of knowledge, once you learn something you'll loose all empathy and assume everyone else knows it too. It still surprises me though, I've had bug like my regex matching terminal color sequences messing up the data if it was colored.


It feels like something that was more discovered than invented, something that would exist even if nobody knew of its existence. I get the same feeling when listening to Pharrell Williams' Happy.


As the examples in the article use xml, I just wanted to point out that applying regex to xml has a lot of limitations and should be avoided. See: https://stackoverflow.com/questions/1732348/regex-match-open...


I was thinking about that great answer when I was reading the article. Thanks for sharing it.


The solution...

    not_this|(but_this)
... is interesting. But since it returns the match in a submatch I would say the \K approach is better:

    (?:not_this.*?)*\Kbut_this
Because usually when you try hard to accomplish something with a regex, you do not have the luxury to say "And then please disregard the match and look at the submatch instead".


That doesn't work. `(?:"Tarzan".*?)*\KTarzan` should behave identically without `\K`, and it will match `"Tarzan" "Tarzan"` because the ungreedy quantifier ? still allows backtracking (it just changes the search order). You want the possessive quantifier + instead; `not_this|(but_this)` is equivalent because regexp engines will not look back into once matched string.


Interesting. I took the \K solution right from the article without trying it.

Now that I try it, it indeed does not work.

Maybe the author reads this and can look at it.


speaking as an old regexp wizard from before perl5, this is indeed a great trick, have an upvote.

sadly, this trick still requires a code comment to explain. Python example:

  # match tarzan but not "tarzan"
  # see https://news.ycombinator.com/item?id=27774584
  if "tarzan" == re.search(r'"tarzan"|(tarzan)', myvar)[1]:
     ...
which in practice means it probably deserves a function:

  if re_search_but_exclude(r'tarzan', myvar, '"tarzan"'):
     ...
I don't recommend monkeypatching re, i.e. re.search_but_exclude = ...


If you are comparing the match to the wanted string, it defeats the purpose of the capture group.

  if "tarzan" == re.search(r'"tarzan"|tarzan', myvar)[0]:
      ...
Am I missing something?


Is there a reason you have an r-string for the first arg but not for the third one?


It's a harmless mistake.


A bit off topic, but the commented version was much clearer, than the version with separate function. (full sentences are very good at explaining things)


My biggest grief with regexp is that it is just a compact code disguised as something else. It is relatively common that you want to scan a string but action codes intermixed. There is a way to do that with regexp (Perl (?{...}) etc. or PCRE callouts), but it is always awkward to put a code to a regexp. As a result we typically end up with either a complex code that really should have used a regexp but couldn't, or a contorted regexp barring the understanding. The essay suggests `(*SKIP)(*FAIL)` at the end, which is another evidence that a code and a regexp don't mix well so a regexp-only solution is somehow considered worthy.


A shorthand memory hook to remember "Tarzan"|(Tarzan) is that this is similar to conditional evaluation of boolean expressions. For example, in Python you often do

foo = foo or [ 23, 42 ]

Or more generally:

foo = foo or ConstructSomeFoo()

If foo is None then it gets this default value or the newly constructed object, otherwise it's unchanged. Key here is that what's after "or" is not even evaluated if the first operand is already evaluating to True.

So, the left "Tarzan" eats up the matching substring that we do not want, while the right (Tarzan) matches what we do want, but only if the left one didn't already hit.


Needless to say, all of this only works if the regex engine satisfies certain assumptions, i.e., order of "evaluation" is guaranteed.

In abstract semantics of regular expressions, a|b and b|a are equivalent.


It doesn't rely on order of operations, at least in this example?

"Tarzan" will match one character earlier than Tarzan as the sting is scanned, so it would be discarded even if you flipped the order of the alternation.

This isn't true of examples where the good and bad marches can start at the same character.


That the input is matched position by position to alternative patterns and the first one is returned is another one of said assumptions.


I consider myself a decent regex-er, but, despite several attempts over the years (admittedly all in moments when I had an urgent problem to solve), I still don't get lookbehinds / lookaheads, and end up finding some way to do without them.

Nice to see some examples of how ugly lookbehinds / lookaheads can be. And nice to have a new trick for avoiding them!

Although personally, I still think the most pragmatic solution in this case is usually to just filter out "Tarzan" values somewhere other than in regex.


Look(ahead|behind)s do not consume any of the string which is probably where they confuse people.

I see non-look regexes as stepping over each of the characters where you can never go back in time - once a char is stepped over it is gone.

“Looks” allow you to step over characters to true|false match them, then step back in the string as if the look did not exist.


One past thread:

The Greatest Regex Trick Ever (2014) - https://news.ycombinator.com/item?id=10282121 - Sept 2015 (131 comments)


vim.

This one bugs me, because it's a cool enough trick, and I want to expand my thought process when it comes to regex.

The closest I can get visually in vim:

  /"tarzan"\zs\|tarzan
(The \zs flag starts the cursor and highlighting at a given location inside of a larger regex match. I didn't use a capture group here because it didn't help.)

Two problems:

1. This will still match the quoted word when pressing "n" but mostly unhighlights it. (see next point)

2. Whatever single character is after the unwanted match is highlighted, so this would only help for visually searching for reasonably long expressions.

-----

An alternative that I would use unless a special edge case was present (and this is basically the dumb version of the author's typical solutions):

  /tarzan\ze[^"]
(\ze ends the match but continues to filter whatever follows)

In a persistent edge case, I'd probably resort to macros or temporary replacement of the unwanted term. But that's not very satisfying, is it?

-----

More details:

Capture group references evidently work in vim's search mode. I hadn't tried until now. I only see utility in a few cases e.g. finding any duplicate word. The specific case given at the link does not work as-is. I'd need a way of evaluating the author's full expression and then only match the capture group. Is there a way to put the capture group \1 outside of the alternation?

There's possibly a way to use back-referencing or global search and execution or branches. The solution is also probably very clever and concise! I've tried a few permutations and am still stumped.

-----

Last best attempt:

  /"\@<!tarzan"\@<!
(\@<! will match if the previous atom---in this case, double quotes---is not present.)

An edge case where this falls apart? Single leading double quote e.g. "tarzan

Is there a better way?


I used to just go ask Friedl.


That's a Kleene solution.


Bravo


Please make sure to use the level of regex that is standard on your team and be very selective about going beyond that.

Always over-document your regexes and assume people only have very basic regex skills.


And use the whitespace "trick" (usually via a metacharacter).

This:

    HEADER_PARAM =
    /\s*[\w.]+=(?:[\w.]+|"(?:[^"\\]|\\.)*")?\s*/
is not as useful as this:

    /
      \s*           # Maybe whitespace at the beginning
      [\w.]+        # Header key
      =             # Equals (yes:)
      (?:           # Header value from here
        [\w.]+      # Just about anything
          |         # or
        "           # it could be wrapped in quotes
          (?:
            [^"\\]  # Not quotes or backslashes
              |
            \\.     # Sometimes escaped characters occur
          )*        # Could even be an empty string
        "
      )?            # Maybe they didn't supply a value
      \s*
    /x
If you can use interpolation with your regexes you can extend this idea into (largely) self-documenting regexes.


Yes, this should be the best practice.


A funny thing is that I read that post years ago

And this week I randomly thought about it.

I also thought I should have bookmarked it, because now I do not know where to find it again


Since this was written, variable-width look-behinds were added to JavaScript. You're welcome.


So, I realize that things get more complex when you start extending the length of your "context" (though I will argue that in a lot of these cases the result is wrong anyway, so attempts to make it less wrong are weird: you can't, for example, match XML with a regex, so if you are doing that at all I'm expecting you are at a command line trying to do some quick grep and sed filtering, in which case there are some really really easy solutions at hand that are going to be fine), but... this article starts with the premise that the default solution to this is somehow lookaround, but lookbehind in paticular is a feature you aren't given often enough that it is worth avoiding it, and it is easy to do so in this case: '((?!"Tarzan").|^)Tarzan'. Though like, "real talk": does whatever random tool you are using even have lookahead? The reality is that lookaround is convenient, and I've totally gotten lost in its intricacies a long time ago, but this is such a simple case that it seems worthwhile to appreciate how it works and how the underlying expressions manifest... and then like, I can appreciate that the "trick" the author is advocating for is easily extensible to multiple "contexts" (which I keep putting in quotes, as if we are honest about this being "context" then you can't solve it with a true "regular expression"... we are kind of half-assing this by not realizing that the middle Tarzan in '"Tarzan"Tarzan"Tarzan"' is not actually enclosed in quotes), but it is even less useful than the lookaround variants (which are at least supported by grep -P)... how am I supposed to pass the article's recommendation to grep, much less grep -l? If we are somehow required to solve this problem with a single regular expression (which is totally an acceptable limitation, as that's what makes using grep -l complex: doing multi-stage filtering is annoying), my recommendation thereby would be to simply do: '([^"]|^)Tarzan|"Tarzan([^"]|$)'. This doesn't require any fancy features, and I think is thereby much easier to explain than anything you can find in this article (including the author's "trick"). If you don't have to do it in a single pass, then do it in two: grep 'Tarzan' | grep -v '"Tarzan"' (which is sloppy, but again: you can't actually do this task using regular expressions anyway, so sloppy is fine: look at the result and verify it makes sense... under no circumstances should you code stuff like this for automated use in production, though, which is then scary as the author's "trick" is really only applicable to sitting around in a heavier language, meaning they might not understand that this is all flawed by definition).


Dumb question, this wouldn't work for global matching, right?

/"Tarzan"|(Tarzan)/g


The website is so nice, it reminds me of the better times of the web


It is unreadable for me. This is how it looks in firefox: https://ibb.co/TP5WjDY


Anyone know what the lua patterns equivalent is to this trick?


.replace('"tarzan"','12345678')


Is anyone having trouble reading the page? It renders as dark gray on slightly darker green and is illegible.


Yes it is a disaster: https://ibb.co/TP5WjDY


Weird, FF on one of my boxes shows ^ that mess, while the same FF version on another box shows it nicely rendered: https://ibb.co/z8QqGj2

Probably one of my privacy plugins blocking something but I'm not going to debug someone else's page today.


yes. outline.com fixed it nicely

https://outline.com/YSYgsp


The greatest regex trick ever is knowing when not to use one.


The greatest regex /skill/ is knowing that a regex cannot describe everything.


I've seen several regexs in various code reviews that are used to validate user input but do so in an exponential manner that can be exploited for simple DOS attacks.


Ooooh or worse, I once caught someone's "email matching" RegEx code during a code review that was opening the door for some nasty SQL Injection or XSS attacks (kind of like validating if the text field contained a valid email.. but not if it was ONLY a valid email).

The problem with RegEx is its "obscurity". However Maybe someone could write a nice testing tool that would throw millions of known exploits into each regex it finds in your code to see if it is vulnerable.


Like what? I've never thought about what regex features are exponential.


It's more a question of which ones can't be. There are some really nasty and not very obvious gotchas here; https://regular-expressions.mobi/catastrophic.html has a good dive into how, for example, backtracking combines with incautious regex design to produce exponential behavior in the length of input.

I don't have a hard and fast rule of my own about regex complexity, but I do have a strong intuition over what's now ca. 25 years of working with regexes dating back to initial exposure in Perl 5 as a high schooler. That intuition boils down more or less to the idea that, when a regex grows too complex to comprehend at a glance, it's time to start thinking hard about replacing it with a proper parser, especially if it's operating over (as yet) imperfectly sanitized user input.

Sure, it's maybe a little more work up front, at least until you get good at writing small fast parsers - which doesn't take long, in my experience at least; formal training might make it easier still, but I've rarely felt the lack. In exchange for that small investment, you gain reliability and maintainability benefits throughout the lifetime of the code. Much of that comes from the simple source of no longer having to re-comprehend the hairball of punctuation that is any complex regex, before being able to modify it at all - something at which I was actually really good, as recently as a decade or so ago. The expertise has since expired through disuse, and that's given me no cause for regret; the thing about being a regex expert is that it's a really good skill for writing unreadable and subtly dangerous code, and not a skill good for much of anything else. Unreadable and subtly dangerous code was fine when I was a kid doing my own solo projects for fun, where the worst that'd happen is I might have to hit ^C. As an engineer on a team of engineers building software for production, it's not even something I would want to be good at doing.


> That intuition boils down more or less to the idea that, when a regex grows too complex to comprehend at a glance, it's time to start thinking hard about replacing it with a proper parser

You can get some surprisingly complex yet readable regexes in Perl by using qr//x[1] and decomposing the pieces into smaller qr//s that are then interpolated into the final pattern, along with proper inline comments in the regexes themselves.

[1] https://perldoc.perl.org/perlre#/x-and-/xx


You still have to reason about the whole thing, though. This doesn't make that any easier, but I bet it makes it feel easier.


Decomposition is a proven method for making complex code both feel and actually be easier to reason about.

Regexes are code.

Therefore, decomposition makes complex regexes both feel and actually be easier to reason about.


I don't see anything about qr//x that makes regexes built this way less vulnerable to the kind of exponential backtracking problem under discussion here.

I do see a great opportunity to, by assuming interpolated qr// substrings have the locality the syntax falsely suggests, inadvertently create exactly that kind of mishap with it being minimally no easier, and potentially actually more difficult, to notice.

Write your code however you like, of course, including concatenating strings and passing the result to 'eval'. The last time I dealt with more Perl than a shell one-liner was around 2012, and that the language encourages this kind of thing is one of the reasons I'm glad of that.


Given that I write my code with a text editor that does nothing but concatenate strings that I input and then I pass it to a compiler or an interpreter, all of the code I write is concatenating strings and passing it to 'eval'.

And I use proper decomposition to keep it cognitively manageable. It's pretty clear that reasoning about composition is beyond you, but trust me that given two procedures that both do not have an undesirable property, one can rest assured that simple composition will not introduce that undesirable property.


Many things are beyond me. Perhaps it's to my good fortune that the generally low utility of gratuitous personal insults is not among them. Certainly the next technical discussion I see improved by such behavior will be the first.


Well then, in the interest of amity let me suggest that it would be to your good fortune to work on your self-awareness. But, should you prefer not to, then by all means, you do you.



it's nice. I'm way more dumbfounded by the prime thing though


Me too. I had to look it up. This page has pretty good breakdown:

https://itnext.io/a-wild-way-to-check-if-a-number-is-prime-u...

The main trick for me was you first have to convert the number to unary, which was done outside of the regex.


> The Greatest Regex Trick Ever

was to convince programmers it didn't exist?


the best regex trick is not to use regex!


For me, the site rendered dark gray text on a dark gray background and is a chore to read as-is. Outline.com fixed my issue with it: https://outline.com/YSYgsp


"Please don't complain about website formatting, back-button breakage, and similar annoyances. They're too common to be interesting. Exception: when the author is present. Then friendly feedback might be helpful."

(It's not that the annoyances aren't annoying, it's that they're so common that they lead to repetitive offtopicness that compounds into more boring threads.)

https://news.ycombinator.com/newsguidelines.html


I got curious and looked back in archive.org to this page's initial release in 2014. The text background started out as good old reliable background-color: #EEEEEE, which was later replaced with background: url("http://a.yu8.us/bg-tile-parch.gif")

...because what could possibly go wrong? From the latest comment at the end of the page, the author would like you to know that the outcome is your problem, because you're using the wrong browser:

June 20, 2021 - 15:02

Subject: RE: Undoing whatever is hiding this page.

Hi Allen, try a different browser. There's no strange shading on the page, your browser is deciding to display it in a weird way. Regards, -Rex


Most likely using the HTTPS Everywhere addon. That website is not available via HTTP, and the user must visit the page first to accept the 'risk' of using the http version.


Firefox also defaults to HTTPS by default nowadays. Lots of content blockers block third party content too. Regardless, if literally anything goes wrong with the third party dependency that the article's contrast depends on, the best case scenario here is that the text falls back on the body's background.

Interestingly, the author also appears to control yu8.us

Breaking one's own content by https-ing one site but not another is a great example of why to not prop up a website's basic legibility on a third party dependency, even if it's one you own and control.


Yes, they web author made the mistake of defining the <article> background-color: #EEEEEE within a min-width 960px media query. If the background image fails to load in wider window, there's still a readable contrast between text and background but on a phone or other narrow screen, the dark background color set on the <body> is what's behind the article text.


It's definitely nothing to do with the following string in the response:

> Page copy protected against web site content infringement by Copyscape


firefox shows it as black(ish) text on a light yellow background. I think you must be blocking something


Clicking on a http:// link these days feels like I have been tricked into clicking on a phishing link in an email.

Good trick though.


This is why any attempts to make plain http sites throw up scare warnings is a horrible idea. The internet is littered with old websites that contain a wealth of knowledge and deserve to remain accessible.

Just make browsers for into “read only” mode where input cannot be accepted on non-secure pages. But don’t wall them out!


It would be nice if HN marked such links before I click on them (especially on mobile I can't see the link before I click on it easily). @dang?


Please don't


Please don't use regular expressions to parse Dyck languages. It doesn't work.


Regexp for tokenization does work. This entire essay boils down to the fact that you can always postprocess matches and in this case that corresponds to tossing unwanted tokens out.


Yes, tokenization is regular.

Parsing the tokenization result of a Dyck language still requires as context-free grammar.

It's not a badge of honor or a great trick to try that with regular expressions. It is using the wrong tool for the job.


This site reminded me the times when I interviewed candidates. One of the interview problems was to write a function that would validate if a given string was a valid IPv4 address (a la 10.10.10.1).

Some of the candidates started by saying: "I know! I'll use a Regular Expression", to what I replied: "Great!, now you have TWO problems!"


this should do it:

    ^((1\d\d|2[0-4]\d|25[0-5]|[1-9]\d|[1-9])\.){3}(1\d\d|2[0-4]\d|25[0-5]|[1-9]\d|[1-9])$
Was the second problem "the interviewer"?


Doesn't match 10.0.0.55

No hire :-)


Fun fact:

2134567890 is a valid IPv4 address

You can try and ping it


This is simply not absolutely correct, only conditionally so.

The only way in which

  "Tarzan"|(Tarzan)
extracts a Tarzan that is not in quotes is when it is used for scanning the input for non-overlapping matches.

(We know from lexical analysis with regexes, a form of non-overlapping extraction, that the "Tarzan" token is different from a Tarzan token. An identifier won't be recognized if it is in the middle of a string literal.)

It's not the regex itself, but a particular way of using it.

If the regex is used for finding all maximally long matching substrings, then it won't work. It will find "Tarzan" and it will find the Tarzan also within those quotes.

Notably, the regex will also fail if it is used to find a single match, like the leftmost. If the datum is a string like

   "Tarzan", said Jane; Tarzan turned.
then the leftmost "Tarzan" will be found, and that's it. The regex will not find the leftmost Tarzan that is not wrapped in quotes.

We cannot even use this to simply grep files for lines that have Tarzan that is not in quotes.


regex101.com says it works, by returning multiple matches, only one of which has a group 1.

But I don't what environments return multiple matches from one evaluation.


The matches must be non-overlapping, because Tarzan is contained in "Tarzan". Therefore, the input

  "Tarzan", she said.
contains two matches for the regex

  "Tarzan"|Tarzan.
The first match is at character 0, for the "Tarzan" branch of the regex. The second match is at character 1 for the Tarzan branch of the regex.

If matches can be overlapping, then the inner Tarzan is matched in spite of being surrounded in quotes, and the capture register is bound and all.

This works not as a property of the regex (what it matches), but the regex combined with a scanning algorithm that extracts non-overlapping matches.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: