Doc Brown spoke to me: "You're just not thinking fourth dimensionally!" Like Marty, I have a real problem with that. But don't you see: [b]anana matches banana but it doesn't match 'grep [b]anana' as a raw string. And so I get only the process I wanted!
the simple things in life elude me. holy cow. i learned the "tag a grep -v grep" at the end and never looked into refining it. but just flipping the greps? nope, not once did that ever occur to me. thanks
That's a valid concern, but in this case it won't cause problems -- grep exits with 0 if a line was selected; in the case of `grep -v grep` that means there was a line without "grep", which is what we want.
(Also @thewakalix made a good suggestion to reverse the greps.)
pgrep is great, but note that you can still encounter this problem if you run pgrep in parallel -- it will never match its own process, but it will match other pgrep processes.
So for example if you have a script that uses `pgrep -f banana` to search for a "banana" process, and you run that script twice in parallel, pgrep might see the other pgrep process and think "banana" is running even though it isn't.
What systems don’t have grep? Off the top of my head, anything vaguely POSIX compliant would have it, and anything descended from 4th Edition Unix would also have it, which seems like it would cover everything.
Toolchains for compiling systems from source is another answer, e.g., NetBSD's toolchain has sed, but not grep.
A third answer is install media. For example, NetBSD install kernels have ramdisks with sed but not grep.
A fourth answer is personal, customised systems. I create small systems that run from RAM. I run these on small computers with limited resources. When one of these computers first boots up, it may not have a "full" set of userland programs. It may not have grep. I am not inclined to use the limited space available to include grep at such an early stage if I can get by with sed.
Maybe someone deleted grep by accident? Or maybe you accidentally traveled in time to the 70's and need to use a nearby PDP-11 to calculate a way home? Lots of possibilities.
The regex `[b]anana` matches the string "banana". But it does not match the literal string `[b]anana` --- to match that, you'd need something like `\[b\]anana`. The literal string `[b]anana` is what shows up in the process table for grep, and so it doesn't match.
The more general tip is that a single regex isn't the only tool you have. You don't have to get your final product one one step. Almost every "disaster" regex comes from someone trying to do too much at once.
One other solution would have been to run the regex twice, once to pick up all instances of Tarzan, and a second on the results of the first to filter out all instances of "Tarzan".
This is also often the problem with disaster SQL queries. I've seen some monsters that got hopelessly tangled up in their own JOIN constraints, trying to fetch all the data in one roundtrip because OMG LATENCY but then having to do full table scans over large-ish tables instead. Rewriting it as three small indexed queries reduced the runtime from 40 minutes (!) to less than a second.
Don't do too much in one operation whether it's regexes, SQL queries or OOP classes!
I’ve used the pattern several times of “select these SQL objects into a cursor, then iterate over the cursor to assign/revoke/check permissions on the objects”.
It’s still in a single batch of SQL (stored procedure in our case, so no additional network roundtrips), but the code is vastly clearer to read/maintain this way.
They're clearer in the sense that it makes it easier/possible to do things like:
While maintaining/changing the SQL, comment in/out select-statements-as-printf-debugging, and comment in/out actual execution of the statements themselves.
These cursors would often contain [identifying object reference], [category of statement], [text of SQL statement to execute]. You would write a select statement to populate the cursor, then loop over the cursor to run all the statements in the order you wanted (drops, then user/role creates, then grants, or whatever the situation called for).
It's not about logical clarity, but practical maintainability given the (overall weak) state of tooling for database queries. Is it a bastardization of SQL to do something that "should be" done in another scripting language? Maybe, but there's a lot of power in giving the DBAs tooling that works exclusively in a language and environment that's familiar for them rather than splitting it across SQL and python/tcl/ruby/whatever. Not nearly every competent [relational] DBA is competent across multiple languages. Every competent [relational] DBA is competent in SQL.
Is it even possible to use set-based SQL to call EXEC SQL EXECUTE IMMEDIATE or sp_executesql on each statement in a set?
I’m pretty decent with regex, but I often break complex regexes down into multiple steps for better clarity and easier debugging. Sure, you can use extremely clever one-liners, but the next maintainer of your code may hunt you down and murder you on the spot for wasting weeks of their time.
to your point (and how to fix it in a way that seemingly nobody does) - you can make complicated regular expressions pretty simple by using named groups, and ignoring pattern whitespace because they allow you to logically separate different components and specify intent. Nobody would ask a fellow dev to debug javascript where it is all on a single line and every variable name is a1, a2, etc. Except people do it all the time with regex. Its insane. Hell, you don't even get the blessing of a1, a2, a3. It is all unnamed. Insanity.
And, as always, the "next maintainer" is most likely your future self. So conversely, it's really nice to give your future self the gift of clarity and con-conversely look back and say "Thanks past me!"
A big source of trying to do too much is environments that offer easy regex-based transformations defined as a pair of regex and a single replacement string (that may contain references to matching groups) and make other transformations hard ("while find + rest"). When you have the option to provide a "process match" closure instead of the replacement string the lure of putting too much into a single regex almost collapses.
Caveats apply. A regular expression isn't just a way of saving yourself a few lines of explicit string manipulation. It's describing a state machine that does these text operations efficiently (in some programming environments, that state machine will get optimized and compiled down to metal prior to first use).
If you're matching a couple short strings, sure, don't bother overthinking the regex. If you're matching a lot of them, and/or they're long, then the extra time spent on making a single regex work will be worth it. The regex will work smarter than your hand-rolled code, and it also won't waste memory returning partial results.
Also: in my experience, almost every "disaster" regex comes from people not bothering to document and test what they write.
I got the feeling that a lot of those »I have to do this in a single regex« questions come from places where a single regex is basically the only API you have available. Something like form input validation where the framework provides a handy regex it uses for validation, but doesn't expose actual validation callbacks or events do to the same in code without having to redo everything around it. It's only a hunch, but when I have the opportunity to use code to validate a string I probably wouldn't assume that code to be a mandatory one-liner, even as a beginner developer.
This trick may be thought of as a simplification of the systematic approach to parsing stuff, that is the lexer-parser division of responsibilities.
The lexer uses regexes but only for splitting the input stream of characters into tokens. Identifiers, integers, operators, strings, keywords, opening brackets and whatnot - each type of token is defined by a regex. This part is hopefully deterministic and simple, although the lexer matches regexes for all kinds of tokens at once, which is why lexer generators are often used to generate lexers.
The heavy lifting is done by the actual parser which tries to combine the tokens into something that makes sense from the point of the grammar.
So in this trick the sub-regexes between |'s define the tokens (the lexer part) while the group mechanism selects the single token that we want to keep (a very very simple parser).
The ? syntax group has to be the most unmemorable of the bunch. I've used it maybe over 1,000 times or so and I still have to look up ?: Or ?! ?< or whatever else.
I used to have a laminated sheet on my wall at an office because it was so terribly bad.
I'm not sure if any regex library exposes this, but since regular languages are closed over compliment and intersection you could theoretically do something like match("....string..", regex("Tarzan") - regex("\"Tarzan\"")), where the - operation is shorthand for intersection with the compliment. Does anyone know if any regex libraries expose these sorts of operations on the regular expression/underlying DFA?
Intersection with the complement will not work here.
Because the idea is to match Tarzan, but only if it is not preceded and followed by a quote.
Regex intersection and complement do not perform look-behind or trailing context.
Live demo:
This is the TXR Lisp interactive listener of TXR 265.
Quit with :quit or Ctrl-D on an empty line. Ctrl-X ? for cheatsheet.
TXR may be used in areas that are not necessarily well ventilated.
1> [#/Tarzan&~"Tarzan"/ "Jane shouted, \"Tarzan\""]
"Tarzan"
2> [#/Tarzan&~"Tarzan"/ "Jane shouted, \"Tarzan!\""]
"Tarzan"
The &~"Tarzan" makes absolutely no difference. The reason is that Tarzan matches exactly one string. The complement ~"Tarzan" matches a whole countable infinity of strings, and one of those is Tarzan. The intersection of that infinity and Tarzan is therefore Tarzan.
Intersection with complement is useful like this:
Search for a three-character substring that is not cat:
Wouldn't that end up just being the same as 'regex(Tarzan)'? Those regexes can't match the same thing, they can only overlap.
What you want is something like all matches of regex("Tarzan") not contained in a match for regex("\"Tarzan\""), which is a bit trickier. That would require something like:
Unfortunately (or perhaps fortunately), “regexes” as commonly implemented in programming languages are only loosely related to regular expressions from automata theory. With all their extensions, they can recognize much, much more than just regular languages, and I don’t think they’re closed under complement (though I’m not sure). However, most regex engines have a feature called negative lookahead assertions, (?!do not match), which would almost work in the way you suggest.
You have to be careful about inputs like this though: “Inside a string”Tarzan”Again inside a string”
Yeah, a DFA that recognizes a regular language can easily be implemented with O(n) worst case behavior.
My attitude is generally that one should use regexes for matching regular languages and if one needs a stack or even Turing completeness then handle that in code around the regex.
Not exactly that but take a look at https://github.com/mtrencseni/rxe ("literate regex"). I found this on HN and recall the comment thread being good but I can't find it now.
I kinda love that this is written like an infomercial.... for regexes. I've gotten far enough to think, oh, I think I actually just had this problem recently and didn't know how to solve it with regexps, but I'm still not to the part that actually tells me the One Regex Trick, and I'm still reading!
This article is kind of a bait and switch actually. It first states:
>we want to match Tarzan except when this exact word is in double-quotes
and so the reader might start thinking of ways to "match" this. The author then starts to mention ways to do this, but at the end, their trick is actually to not "match" it, but to remember it in a group. This will not match what the author says it will match, because if you do regex.test(string) it will return true when "Tarzan" appears, because it is in the or statement.
It appears the author is very good at storying-telling though.
In the very first section (albeit after from introduction, but still before the Tarzan example you show), it clearly states the limitations:
> Before we proceed, I should point out some limitations of the technique:
The author clearly states you may have to add one or two extra line of code, in your case regex.test(string) may become `regex.match(string).group(1).length` > 0 or something along those lines.
The author explicitly states:
> so it will not work in a non-programming environment, such as a text editor's search-and-replace function or a grep command.
But in a programming environment, I will choose to have one more line of code over the extremely hard to read alternative regexes.
It wasn't really phrased that way, but that was really the core insight of the piece -- if you take a step back and look at the problem from the context of the code running the regex instead of within the regex itself, it's much simpler. Use all the tools at hand to get the job done.
I dunno, the "logic" solution seems like the obvious one to me; if your boss really has that much trouble with propositional logic that they don't immediately see why it works, well, that's what code comments are for.
(...the trick is still cool, though; I can imagine other situations where it would be more useful. However it does seem like it potentially depends on the particular regex engine being used, in contrast to the author's claim about it being totally portable; yes, it'll compile on anything, but will it work?)
How could it not work. I've regularly relied on order or matching, and never found an environment that didn't test left-to-right for the `|` operator in regex.
I'm talking about regex. Regex libraries in practical use do not use NFA. I'm talking about actual code that's written using normal languages. I'm familiar with the difference between "regular expressions" as in "regular languages".
Lex/Flex, wich I think we can agree is used by "actual code that's written using normal languages" use DFAs, both inside rules and between rules, and they do not try '|' cases left to right (They probably could have if they wanted since there is a REJECT action that already force them to store the list of all the rules/texts that were matched):
Very long build up to what is definitely a neat trick, although without SKIP FAIL, it might cause explosive growth in the memory usage as it allocated space for the results you don’t need (unless you use a streaming regex option).
Speaking of lengthy: this site breaks the iOS Safari scroll bar! It just disappears altogether (even when scrolling up or down to make it show, like you have to these days to please the UX designers in Palo Alto).
OK that's pretty clever (I certainly never thought of putting a capturing group inside only one side of an "or")...
...but it doesn't seem particularly useful? It probably won't work in most cases where this is just part of a larger expression. You're usually using capturing groups in a particular way for a good reason, and this would mess that up.
In contrast, the lookbehind+lookahead way is the "proper" and intuitive way to write it, and works as part of any larger expression.
So... +100 points for cleverness, but don't actually use this please. :)
Not GP, but I'd go a very simple and verbose way, maybe that's what they meant to. Match:
(.)Tarzan(.)
Then in an additional line of code assert
(Group 1 == Group 2) ≠ "
This shifts the logic out of regex and into the surrounding programming language context. That's arguably better, but the resulting regex is extremely dull and unclever.
I think dumb, brute force, simple approaches like this are underrated. Writing elegant, pithy code that pleases you aesthetically is nice but writing code that's explicit and obvious and can be maintained by the new kid is often more pragmatic.
I mean, I guess if nobody on your team understands regexes.
But generally, once you decide to use a regex in the first place, you might as well put as much regular everyday logic as you can in it. Otherwise you might as well look for "Tarzan" with a dumb string search.
Lookbehinds and lookaheads aren't rocket science. And you can always leave a comment about what they're doing if you're worried other team members won't grok the syntax.
> Lookbehinds and lookaheads aren't rocket science.
Lookbehinds and lookaheads (especially negative lookbehinds) are rocket science.
What is "rocket science?" "Rocket science" is the feeling you get in math class where the instructor explains a proof to you in the clearest possible terms and you just don't get it. You have to listen to the explanation multiple times, preferably in a few different ways, and then you have to sleep on it, and then you get it, maybe.
But "rocket science" isn't just hard to understand. It's a hard problem where the consequences for failure are catastrophic. When you fail at rocket science, a multi-million dollar rocket explodes.
Anyone who's ever tried to teach lookbehinds to a newbie has seen it: you explain how lookbehinds work, and then ask the newbie to create a regex with negative lookbehind, to demonstrate mastery. I've done it a few times, and they never get it right, ever.
At best, they flub the syntax, but even once they get over that, they usually write the worst possible regex: a regex that works correctly on desired inputs but does the wrong thing on the input the regex is designed to reject.
This is a notorious problem with writing regexes, but it's way worse for negative lookbehind, because it's asserting that something isn't there, rather than querying for something that is there.
When I see a regex with negative lookbehind during code review, I ask for unit tests, not just comments. Reliably, regexes get even more complex when unit tests are added, because it's just so damn hard to write a correct regex with negative lookbeind.
I've never used the "trick" from TFA before, but it already sounds way easier to use than negative lookbehinds, and I'm curious to try it.
I agree on unit tests for non-trivial regexes as a general rule, but respectfully disagree on lookaheads and lookbehinds.
Things like greedy vs. non-greedy matching, matching newlines or not, handling Unicode correctly, inserting a capturing group when you actually needed a non-capturing group, making sure your regex works if it matches the start or end of a string, escaping characters -- those can be tricky.
On the other hand, lookaheads and lookbehinds are conceptually extremely straightforward, you just need a cheatsheet to remember the syntax is all.
Yeah, that's more or less what I meant. Write a regex (plus line of code) to make sure `Tarzan` appears. Then write another regex and line of code to make sure `"Tarzan"` doesn't appear.
Maybe at this point you aren't using regex even. Nice, you solved two problems.
(I do appreciate regex and even use them a lot. But, I use them enough to avoid them as much as possible.)
Part of me reads these things and I'm like "neat trick", but most of the time they more-or-less prove to me that Regex is doomed to a steady and slow decline.
It's just not a particularly good "interface" for the task it is intended to achieve, a little more ability to be "verbose" at the possible price of succinctness I think would go a long way. I'm more-or-less waiting for the "blank" in: "blank" is to Python what Regex is to Perl.
Just like programming tricks in any language supported by copilot:
a. Code, once provided, can be broken down and understood at a far easier level than is required for composition;
b. Worst case, try several test cases to both increase comprehension and reduce the chance of 'gottcha's.
Shouldn't be too hard to stick with option 'a' as clear best practice, looking up any operators or syntax that aren't immediately obvious, the advantage being that the AI can use obscure tricks that you aren't initially aware of but you still have the opportunity to review and understand the regex, becoming better over time. It's theoretically auto-generated, but practically computer-assisted.
Smartass answers only. Ask autopilot to write a proof with --nojargon so juniors will get it! Write a single unit test and call it good! Step through it in a debugger on that one unit test to be sure? Sure of what? I dunno but it sounds diligent...
When I watched Idiocracy, a small optimist in me said "but surely the techies..." That optimist has died. We're fucked
Probably the way most people do - they run it over whatever examples they can think of at the moment as a check, and then forget about it till it breaks.
> I'm more-or-less waiting for the "blank" in: "blank" is to Python what Regex is to Perl.
This will sound like a forced joke but I genuinely didn't understand your phrase. I got stuck re-reading several times the "blank" in: "blank" part, but my mental language regex wasn't matching the expression.
I think the bug is caused by a bogus quote that causes a bad parameter expansion. My regex engine parses this better: the "blank" in: "blank is to Python what Regex is to Perl"
It's semantics, not syntax. It's one of the simplest and oldest search engines in the history of computing; obviously a more complex engine can provide more robust semantics.
It took me over 15 years until I started to willingly use RegExp, but now I can't live without it. It's like the curse of knowledge, once you learn something you'll loose all empathy and assume everyone else knows it too. It still surprises me though, I've had bug like my regex matching terminal color sequences messing up the data if it was colored.
It feels like something that was more discovered than invented, something that would exist even if nobody knew of its existence. I get the same feeling when listening to Pharrell Williams' Happy.
... is interesting. But since it returns the match in a submatch I would say the \K approach is better:
(?:not_this.*?)*\Kbut_this
Because usually when you try hard to accomplish something with a regex, you do not have the luxury to say "And then please disregard the match and look at the submatch instead".
That doesn't work. `(?:"Tarzan".*?)*\KTarzan` should behave identically without `\K`, and it will match `"Tarzan" "Tarzan"` because the ungreedy quantifier ? still allows backtracking (it just changes the search order). You want the possessive quantifier + instead; `not_this|(but_this)` is equivalent because regexp engines will not look back into once matched string.
A bit off topic, but
the commented version was much clearer, than the version with separate function.
(full sentences are very good at explaining things)
My biggest grief with regexp is that it is just a compact code disguised as something else. It is relatively common that you want to scan a string but action codes intermixed. There is a way to do that with regexp (Perl (?{...}) etc. or PCRE callouts), but it is always awkward to put a code to a regexp. As a result we typically end up with either a complex code that really should have used a regexp but couldn't, or a contorted regexp barring the understanding. The essay suggests `(*SKIP)(*FAIL)` at the end, which is another evidence that a code and a regexp don't mix well so a regexp-only solution is somehow considered worthy.
A shorthand memory hook to remember "Tarzan"|(Tarzan) is that this is similar to conditional evaluation of boolean expressions. For example, in Python you often do
foo = foo or [ 23, 42 ]
Or more generally:
foo = foo or ConstructSomeFoo()
If foo is None then it gets this default value or the newly constructed object, otherwise it's unchanged. Key here is that what's after "or" is not even evaluated if the first operand is already evaluating to True.
So, the left "Tarzan" eats up the matching substring that we do not want, while the right (Tarzan) matches what we do want, but only if the left one didn't already hit.
It doesn't rely on order of operations, at least in this example?
"Tarzan" will match one character earlier than Tarzan as the sting is scanned, so it would be discarded even if you flipped the order of the alternation.
This isn't true of examples where the good and bad marches can start at the same character.
I consider myself a decent regex-er, but, despite several attempts over the years (admittedly all in moments when I had an urgent problem to solve), I still don't get lookbehinds / lookaheads, and end up finding some way to do without them.
Nice to see some examples of how ugly lookbehinds / lookaheads can be. And nice to have a new trick for avoiding them!
Although personally, I still think the most pragmatic solution in this case is usually to just filter out "Tarzan" values somewhere other than in regex.
This one bugs me, because it's a cool enough trick, and I want to expand my thought process when it comes to regex.
The closest I can get visually in vim:
/"tarzan"\zs\|tarzan
(The \zs flag starts the cursor and highlighting at a given location inside of a larger regex match. I didn't use a capture group here because it didn't help.)
Two problems:
1. This will still match the quoted word when pressing "n" but mostly unhighlights it. (see next point)
2. Whatever single character is after the unwanted match is highlighted, so this would only help for visually searching for reasonably long expressions.
-----
An alternative that I would use unless a special edge case was present (and this is basically the dumb version of the author's typical solutions):
/tarzan\ze[^"]
(\ze ends the match but continues to filter whatever follows)
In a persistent edge case, I'd probably resort to macros or temporary replacement of the unwanted term. But that's not very satisfying, is it?
-----
More details:
Capture group references evidently work in vim's search mode. I hadn't tried until now. I only see utility in a few cases e.g. finding any duplicate word. The specific case given at the link does not work as-is. I'd need a way of evaluating the author's full expression and then only match the capture group. Is there a way to put the capture group \1 outside of the alternation?
There's possibly a way to use back-referencing or global search and execution or branches. The solution is also probably very clever and concise! I've tried a few permutations and am still stumped.
-----
Last best attempt:
/"\@<!tarzan"\@<!
(\@<! will match if the previous atom---in this case, double quotes---is not present.)
An edge case where this falls apart? Single leading double quote e.g. "tarzan
/
\s* # Maybe whitespace at the beginning
[\w.]+ # Header key
= # Equals (yes:)
(?: # Header value from here
[\w.]+ # Just about anything
| # or
" # it could be wrapped in quotes
(?:
[^"\\] # Not quotes or backslashes
|
\\. # Sometimes escaped characters occur
)* # Could even be an empty string
"
)? # Maybe they didn't supply a value
\s*
/x
If you can use interpolation with your regexes you can extend this idea into (largely) self-documenting regexes.
So, I realize that things get more complex when you start extending the length of your "context" (though I will argue that in a lot of these cases the result is wrong anyway, so attempts to make it less wrong are weird: you can't, for example, match XML with a regex, so if you are doing that at all I'm expecting you are at a command line trying to do some quick grep and sed filtering, in which case there are some really really easy solutions at hand that are going to be fine), but... this article starts with the premise that the default solution to this is somehow lookaround, but lookbehind in paticular is a feature you aren't given often enough that it is worth avoiding it, and it is easy to do so in this case: '((?!"Tarzan").|^)Tarzan'. Though like, "real talk": does whatever random tool you are using even have lookahead? The reality is that lookaround is convenient, and I've totally gotten lost in its intricacies a long time ago, but this is such a simple case that it seems worthwhile to appreciate how it works and how the underlying expressions manifest... and then like, I can appreciate that the "trick" the author is advocating for is easily extensible to multiple "contexts" (which I keep putting in quotes, as if we are honest about this being "context" then you can't solve it with a true "regular expression"... we are kind of half-assing this by not realizing that the middle Tarzan in '"Tarzan"Tarzan"Tarzan"' is not actually enclosed in quotes), but it is even less useful than the lookaround variants (which are at least supported by grep -P)... how am I supposed to pass the article's recommendation to grep, much less grep -l? If we are somehow required to solve this problem with a single regular expression (which is totally an acceptable limitation, as that's what makes using grep -l complex: doing multi-stage filtering is annoying), my recommendation thereby would be to simply do: '([^"]|^)Tarzan|"Tarzan([^"]|$)'. This doesn't require any fancy features, and I think is thereby much easier to explain than anything you can find in this article (including the author's "trick"). If you don't have to do it in a single pass, then do it in two: grep 'Tarzan' | grep -v '"Tarzan"' (which is sloppy, but again: you can't actually do this task using regular expressions anyway, so sloppy is fine: look at the result and verify it makes sense... under no circumstances should you code stuff like this for automated use in production, though, which is then scary as the author's "trick" is really only applicable to sitting around in a heavier language, meaning they might not understand that this is all flawed by definition).
I've seen several regexs in various code reviews that are used to validate user input but do so in an exponential manner that can be exploited for simple DOS attacks.
Ooooh or worse, I once caught someone's "email matching" RegEx code during a code review that was opening the door for some nasty SQL Injection or XSS attacks (kind of like validating if the text field contained a valid email.. but not if it was ONLY a valid email).
The problem with RegEx is its "obscurity". However Maybe someone could write a nice testing tool that would throw millions of known exploits into each regex it finds in your code to see if it is vulnerable.
It's more a question of which ones can't be. There are some really nasty and not very obvious gotchas here; https://regular-expressions.mobi/catastrophic.html has a good dive into how, for example, backtracking combines with incautious regex design to produce exponential behavior in the length of input.
I don't have a hard and fast rule of my own about regex complexity, but I do have a strong intuition over what's now ca. 25 years of working with regexes dating back to initial exposure in Perl 5 as a high schooler. That intuition boils down more or less to the idea that, when a regex grows too complex to comprehend at a glance, it's time to start thinking hard about replacing it with a proper parser, especially if it's operating over (as yet) imperfectly sanitized user input.
Sure, it's maybe a little more work up front, at least until you get good at writing small fast parsers - which doesn't take long, in my experience at least; formal training might make it easier still, but I've rarely felt the lack. In exchange for that small investment, you gain reliability and maintainability benefits throughout the lifetime of the code. Much of that comes from the simple source of no longer having to re-comprehend the hairball of punctuation that is any complex regex, before being able to modify it at all - something at which I was actually really good, as recently as a decade or so ago. The expertise has since expired through disuse, and that's given me no cause for regret; the thing about being a regex expert is that it's a really good skill for writing unreadable and subtly dangerous code, and not a skill good for much of anything else. Unreadable and subtly dangerous code was fine when I was a kid doing my own solo projects for fun, where the worst that'd happen is I might have to hit ^C. As an engineer on a team of engineers building software for production, it's not even something I would want to be good at doing.
> That intuition boils down more or less to the idea that, when a regex grows too complex to comprehend at a glance, it's time to start thinking hard about replacing it with a proper parser
You can get some surprisingly complex yet readable regexes in Perl by using qr//x[1] and decomposing the pieces into smaller qr//s that are then interpolated into the final pattern, along with proper inline comments in the regexes themselves.
I don't see anything about qr//x that makes regexes built this way less vulnerable to the kind of exponential backtracking problem under discussion here.
I do see a great opportunity to, by assuming interpolated qr// substrings have the locality the syntax falsely suggests, inadvertently create exactly that kind of mishap with it being minimally no easier, and potentially actually more difficult, to notice.
Write your code however you like, of course, including concatenating strings and passing the result to 'eval'. The last time I dealt with more Perl than a shell one-liner was around 2012, and that the language encourages this kind of thing is one of the reasons I'm glad of that.
Given that I write my code with a text editor that does nothing but concatenate strings that I input and then I pass it to a compiler or an interpreter, all of the code I write is concatenating strings and passing it to 'eval'.
And I use proper decomposition to keep it cognitively manageable. It's pretty clear that reasoning about composition is beyond you, but trust me that given two procedures that both do not have an undesirable property, one can rest assured that simple composition will not introduce that undesirable property.
Many things are beyond me. Perhaps it's to my good fortune that the generally low utility of gratuitous personal insults is not among them. Certainly the next technical discussion I see improved by such behavior will be the first.
Well then, in the interest of amity let me suggest that it would be to your good fortune to work on your self-awareness. But, should you prefer not to, then by all means, you do you.
For me, the site rendered dark gray text on a dark gray background and is a chore to read as-is. Outline.com fixed my issue with it: https://outline.com/YSYgsp
"Please don't complain about website formatting, back-button breakage, and similar annoyances. They're too common to be interesting. Exception: when the author is present. Then friendly feedback might be helpful."
(It's not that the annoyances aren't annoying, it's that they're so common that they lead to repetitive offtopicness that compounds into more boring threads.)
I got curious and looked back in archive.org to this page's initial release in 2014. The text background started out as good old reliable background-color: #EEEEEE, which was later replaced with background: url("http://a.yu8.us/bg-tile-parch.gif")
...because what could possibly go wrong? From the latest comment at the end of the page, the author would like you to know that the outcome is your problem, because you're using the wrong browser:
June 20, 2021 - 15:02
Subject: RE: Undoing whatever is hiding this page.
Hi Allen, try a different browser. There's no strange shading on the page, your browser is deciding to display it in a weird way. Regards, -Rex
Most likely using the HTTPS Everywhere addon. That website is not available via HTTP, and the user must visit the page first to accept the 'risk' of using the http version.
Firefox also defaults to HTTPS by default nowadays. Lots of content blockers block third party content too. Regardless, if literally anything goes wrong with the third party dependency that the article's contrast depends on, the best case scenario here is that the text falls back on the body's background.
Interestingly, the author also appears to control yu8.us
Breaking one's own content by https-ing one site but not another is a great example of why to not prop up a website's basic legibility on a third party dependency, even if it's one you own and control.
Yes, they web author made the mistake of defining the <article> background-color: #EEEEEE within a min-width 960px media query. If the background image fails to load in wider window, there's still a readable contrast between text and background but on a phone or other narrow screen, the dark background color set on the <body> is what's behind the article text.
This is why any attempts to make plain http sites throw up scare warnings is a horrible idea. The internet is littered with old websites that contain a wealth of knowledge and deserve to remain accessible.
Just make browsers for into “read only” mode where input cannot be accepted on non-secure pages. But don’t wall them out!
Regexp for tokenization does work. This entire essay boils down to the fact that you can always postprocess matches and in this case that corresponds to tossing unwanted tokens out.
This site reminded me the times when I interviewed candidates. One of the interview problems was to write a function that would validate if a given string was a valid IPv4 address (a la 10.10.10.1).
Some of the candidates started by saying: "I know! I'll use a Regular Expression", to what I replied: "Great!, now you have TWO problems!"
This is simply not absolutely correct, only conditionally so.
The only way in which
"Tarzan"|(Tarzan)
extracts a Tarzan that is not in quotes is when it is used for scanning the input for non-overlapping matches.
(We know from lexical analysis with regexes, a form of non-overlapping extraction, that the "Tarzan" token is different from a Tarzan token. An identifier won't be recognized if it is in the middle of a string literal.)
It's not the regex itself, but a particular way of using it.
If the regex is used for finding all maximally long matching substrings, then it won't work. It will find "Tarzan" and it will find the Tarzan also within those quotes.
Notably, the regex will also fail if it is used to find a single match, like the leftmost. If the datum is a string like
"Tarzan", said Jane; Tarzan turned.
then the leftmost "Tarzan" will be found, and that's it. The regex will not find the leftmost Tarzan that is not wrapped in quotes.
We cannot even use this to simply grep files for lines that have Tarzan that is not in quotes.
I have a process table and I want to grep it for the phrase "banana":
ps auxww | grep banana
root 87 Jun21 0:26.78 /System/Library/CoreServices/FruitProcessor --core=banana
mikec 456 450PM 0:00.00 grep banana
Argh! It also greps for the grep for banana! Annoying!
Well, I'm sure there's pgrep or some clever thing, but my coworker showed me this and it took me a few minutes to realize how it works:
ps auxww | grep [b]anana
root 87 Jun21 0:26.78 /System/Library/CoreServices/FruitProcessor --core=banana
Doc Brown spoke to me: "You're just not thinking fourth dimensionally!" Like Marty, I have a real problem with that. But don't you see: [b]anana matches banana but it doesn't match 'grep [b]anana' as a raw string. And so I get only the process I wanted!