Hacker News new | past | comments | ask | show | jobs | submit login
Some useful regular expressions for programmers (lemire.me)
161 points by zdw on April 23, 2021 | hide | past | favorite | 72 comments



Four of those were to catch formatting errors. IMHO it's best to just use an automatic code formatter. clang-format is extremely good.

When I write code for personal projects I don't even bother trying to format it correctly. The only whitespace button I use is the spacebar. tab is mapped to the function that reformats the entire document according to the formatting rules I've configured.

At work, whenever I apply a change in a function I will autoformat that function whenever. I can't just format the entire file, but formatting errors are rife in the codebase I work on.

goto fail; is real.


Yeah, an automatic formatter and linter/language server handles the first 5 very smoothly. The 6th one (surrounding terms with word boundaries) is useful in many situations, but a lot of editors also support doing this natively; e.g. "Match Whole Word" in the VS Code search input fields.

I don't want to be too cynical about the article, since these are absolutely all good things to know due to their myriad uses in other circumstances. But if your intention is to apply these to code you're writing, as the article describes, I don't think any of them are practical if you're already using an IDE/IDE-lite. (It probably makes a lot more sense if the author is using a lighter editor like vim without a lot of plugins, though.)



Second this. I'm pretty happy writing code in a plain old text editor much of the time, without most IDE features.

One thing I just can't live without, however, is a code formatter. Formatting code by hand is tedious and repetitive and I have no patience for it.


Definitely. Install a code formatter, make it part of your linting process so that it fails your CI, run it on save in your editor, never have boring conversations about formatting ever again with your team.


Binding auto format to the tab key has blown my mind!


Alternatively, do it on save. :)

gg=G in vim is fast enough to type for me, though.


We made it part of our ci to test if all files are formated with the formater and i bound the formater to the save operation.

So i can just write, save the file and everything is nice :)


> tab is mapped to the function that reformats the entire document according to the formatting rules I've configured.

Huh. you know...


My favorite regex is /[ -~]*/, space-dash-tilde. That represents approximately A-Za-z0-9 plus punctuation and space. It's useful for something like `tr -d` to remove all common characters and leave all of the "oddities" so you can differentiate "" from “” or ... from … in things like markdown or source code.


Explanation: [ -~] is matching the range of characters† from space (32) to tilde (126), which is the full range of printable ASCII characters. (0–31 are various control characters, and 127 is one last control character, ␡.)

† “Characters” is a deliberately vague term here. Different regular expression engines match different types of things, typically supporting one or two modes. The three most common things to match are bytes (the traditional default, normally encoding-unaware but potentially known to be ASCII or UTF-8), UTF-16 code units (JavaScript’s default, and UTF-16 was an all-round terrible idea that ruined Unicode for everyone else while achieving nothing useful), and Unicode scalar values (which is typically called “Unicode mode” and often enabled by a /u flag).


Add a tab character in there (before the space) and you're even better!


Oh! That's nice to know.


> It is often best to avoid non-ASCII characters in source code. Indeed, in some cases, there is no standard way to tell the compiler about your character encoding, so non-ASCII might trigger problems. To check all non-ASCII characters, you may do [^\x00-\x7F].

Depends of the language. In Python 3, files are expected to be utf8 by default, and you can change that by adding a "# coding: <charset>" header.

In fact, it's one of the reasons it was a breaking release in the first place, and being able to put non-ASCII characters in strings and comments in my source code are a huge plus.

I mentioning this because in another article, the author said being used to code in "C++, C, Go, Java, JavaScript, Python, R, Swift, Rust, C#" so I find it kinda weird.


> It is often best to avoid non-ASCII characters in source code.

Scientists writing Julia are laughing on the floor.


Sometimes I wish COBOL style hadn't won over mathematical style as the convention for every general purpose language except possibly Go.


There are many languages pushing for mathematical style formulas, but until they get easy to type, I don't think any of them will succeed.

It's not COBOL that won. It's qwerty.


Julia makes typing unicode characters very easy; so easy that I keep a Julia repl open a lot of the time just for typing various mathematical symbols easily.

For example, to type α², you just type \alpha, hit the tab button to turn it into α, and then type \^2 and hit tab again to turn that into ².

If someone gives you a unicode character that you don't know how to type, you just hit the ? button to enter the repl-help mode and then paste the character in.

    help?> α²
    "α²" can be typed by \alpha<tab>\^2<tab>

    search:

    Couldn't find α²

      No documentation found.

      Binding α² does not exist.


The amazing thing about this UI is that it just works across many of different editors and REPLs (VSCode, Emacs, Pluto, &etc).

It reminds me of GNU TeXMacs, which has a similar interface for equations. That is, typing "\int" followed by tab will render an integral sign.


> It is often best to avoid non-ASCII characters in source code. ... >> Depends of the language. In Python 3, files are expected to be utf8 by default, and you can change that by adding a "# coding: <charset>" header.

It's interesting that many languages avoid unicode and non-ASCII text, yet they make assumptions about file and directory structures about the underlying system. It's as if interpreting directory and file system structures is "okay", but interpreting file formats is not.

> In fact, it's one of the reasons it was a breaking release in the first place, and being able to put non-ASCII characters in strings and comments in my source code are a huge plus.

Sorry, but as a Python dev that went from 2 to 3, yes native unicode features are nice, but no, it was not worth breaking two decades of existing code.


As somebody living in Europe, I think it's a perspective you can have only if you live mostly in an English speaking world.

Up to 2015, my laptop was plagued with Python software (and others) crashing because they didn't handle unicode properly. Reading from my "Vidéos" directory, entering my first name, dealing with the "février" month in date parsing...

For the USA, things just work most of the time. For us, it's a nightmare. It's even worse because most software is produced by English speaking people that have your attitude, and are completely oblivious about the crashing bugs they introduce on a daily basis.

In Asia it's even more terrible.

And I've heard the people saying you can perfectly code unicode aware software in Python 2. Yes, you can, just like you can code memory safe code in C.

In fact, just teaching Python 2 is painful in Europe. The student write their name in a comment ? Crashes if you forget the encoding header. Using raw_input() with unicode to ask a question? Crashes. Reading bytes from an image file ? Get back a string object. Got garbage bytes in string object ? Silently concatenate with a valid string and produce garbage.


> As somebody living in Europe, I think it's a perspective you can have only if you live mostly in an English speaking world.

I live in Europe and I (mostly) agree that (most) code shouldn't (usually) contain any codepoint greater than 127. It's a simple matter of reducing the surface area of possible problems. Code needs to work across machines and platforms, and it's basically guaranteed that someone, somewhere is going to screw up the encoding. I know it shouldn't happen, but it will happen, and ASCII mostly works around that problem. Another issue is readability. I know ASCII by heart, but I can't memorize Unicode. If identifiers were allowed to contain random characters, it would make my job harder for basically no reason. Furthermore, the entire ASCII charset (or at least the subset of ASCII that's relevant for source code) can easily be typed from a keyboard. Almost everyone I know uses the US layout for programming because European ones frankly just suck, and that means typing accents, diacritics, emoji or what have you is harder, for no real discernible reason.

String handling is a different issue, and I agree that native functions for a modern language should be able to handle utf8 without crashing. The above only applies to the literal content of source file.


I understand to not want utf8 for the identifiers, as you type them often and want the lowest common keyboard layout demonitor, but comments and hardcoded strings, certainly not.

Otherwise, chinese coders can't put their name in comments? Or do they need to invent an ascii names for themself, like we forced them to do in the past for immigration ?

And no hard coded value for any string in scripts then? I mean, names, cities, user messages and label, all in in config files even for the smallest quick and dirty script because you can't type "Joyeux Noël" in a string in your file? Can't hardcode my "Téléchargements" folder when in doing some batch work in my shell?

Do I have write all my monetary symbols with escape sequences? Having a table of constant filled with EURO_SIGN=b'\xe2\x82\xac' instead of using "€"? Same for emoji?

I disagree, and I'm certainly glad that not only Python3 allows this, but do it in a way that is actually stable and safe to do. I rarely have encoding issues nowadays, and when I do, it's always because something outside is doing something fishy.

Utf8 in my code file, yes please.


Lol, just realized my table should be "EURO_SIGN=b'\xe2\x82\xac'.decode('utf8')".

Damn, a perfect example of why it's nice to be able to just do "€" in my file. It's also less bugs.


If we're still talking about Python 3, you can also write,

  EURO_SIGN = '\N{EURO SIGN}'
Which is ASCII, and very clear about which character the escape is. I wish more languages had \N. That said, I'm also fine with a coder choosing to write a literal "€" — that's just as clear, to me.

While I don't believe the \N is backwards compatible to Python 2, explicitly declaring the file encoding is, and you can have a UTF-8 encoded source file w/ Python 2 in that way.

I'd also note the box drawing characters are extremely useful for diagrams in comments.

(And… it's 2021. If someone's tooling can't handle UTF-8, it's time for them to get with the program, not for us to waste time catering to them. There's enough real work to be done as it is…)


You can do Unicode in Python 2. You can do Unicode faster and easier in Python 3. But by gaining that ability, they set the existing community back by 10 years.


I wouldn't say that everything in \x00 to \x1f is OK in source code.


Eh, I reckon they’re fine.

In shell scripts especially (and sometimes elsewhere), if I want an ANSI escape code, I’ll type the actual bytes I want, because it’s more reliable and easier to type <C-V><Esc> and see ^[ in Vim than it is to worry about whether I’m using \u001b, \u{1b}, \x1b, \033 or something else, and whether I have '', "", $'', $"", printf, echo -e, &c. &c. Speaking generally, if you use the actual control characters, the code will either fail to run (parse or compile error) or work, whereas encoding your control codes with escape sequences is more hit and miss. I’ve seen quite a few others using control characters in this way too.

Truth to tell, I’m quite liberal in using all of Unicode in my source code. About the only places I wouldn’t use the actual characters are when (a) the environment requires me to escape the thing; (b) the numeric values of the characters are more meaningful than the characaters themselves; and (c) when dealing with combining characters and modifiers, because otherwise they’ll do things like combine with the opening ' from the string/character literal.


I mean, in non-strawman examples of programming languages.

The code will work, but how will it edit outside of your personal environment? Or print?

If you quote it in a PDF paper, will it be copy-pastable?

Basically I would worry about all that sort of stuff.

If I have echo '^L' to clear the screen, and that program is sent to a line printer, will I get:

  echo '
  [ paper ejected ]
  '
If you so much as "cat" your code to a terminal, all the codes get interpreted.


In Emacs Lisp it's common to to include real ^L specifically to trigger page breaks from when printing code was more common (and it has page-aware navigation commands accordingly).


My favorite regex is the E-Mail Regex[1].

Works 99.99% of the time, is unreadable/unmaintainable, makes you understand that the only true way to verify an e-mail is to send an e-mail to that address.

[1] - https://emailregex.com/

PS: Gotta love the Perl / Ruby version


The only way you should ever "validate" an email address with a regex is like this: /@/


I suppose it depends on what we mean by validate. Running an ecommerce site, I got a lot of mileage out of prompting the customer to fix emails that "looked wrong". We allowed them to proceed if they wanted. A really common one was "user@gnail.com" when "user@gmail.com" was wanted. We used a slightly modified version of https://github.com/mailcheck/mailcheck and found it to be really useful.


Surely that should be /.+@.+/


The one I use for anything that might take user input from a browser is the one defined in the HTML5 spec for input[type=email]:

    /^[a-zA-Z0-9.!#$%&'*+\/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*$/
There’s no sense being less permissive, if it’s good enough for browsers it’s the baseline expected by browser users. But there’s no sense being more permissive for the same reason.


Yes, there is. HTML does not define what email addresses look like. If input [type=email] rejects valid addresses, it's harmful garbage.


Huh? My point is that if you expect user input from a browser’s input[type=email], you have little choice but to accept that it will reject emails not matching that pattern. Harmful garbage or not, a more permissive pattern won’t mitigate that.


Your regex would validate:

  this-is-not-a-valid-address@
  @this-is-not-valid-either
  @


But allowing too much is better than allowing too little, as usually you have to send an actual email to verify ownership anyways. Any regex more complex than /.+@.+/ fails some valid email address


Which was the point of my first message.


Surely that should be

/^[^@]+@[^@]+$/

?


According to the RFC compliant email regex,

   “\@“@example.com
is a valid email address, which your simplified test would reject.


Exactly.

The worst regex would be:

  /^[a-zA-Z0-9\-_]+@[a-zA-Z0-9\-_]+\.[a-zA-Z0-9\-_]+$/
Because it would invalidates `my-email+custom-inbox@example.com`. And that's a pattern I use to automatically sort incoming mails.

Many websites use such a regex :(


The correct way would be to implement a parser off the ABNF defined in whatever the current RFC is for email addresses.


What exactly is the current RFC for email addresses? Do you go with 2822/2823 or do you read all the extension ones?


And yet, the successfully parsed e-mail can still be inexistant and therefore invalid :)


Or, more insidiously, the address can exist, its recipient can receive mail there, your form can validate it, but your SMTP server can't handle the character to send any in the first place.


Nice examples. Feedback for the author if they see this thread:

1) \S can be used instead of [^\s]. I haven't come across a regex flavor that supports \s but not \S. GNU grep supports \s and \S but doesn't support using them inside character classes.

2) For the first example, it is a weird mix of formatting for the code snippets. Rest of the examples are properly code formatted and specified without single quotes around the expression.

3) The fourth example uses variable length lookbehind, which isn't supported by Perl, Python (unless you use regex module), Ruby, etc. \K can be used instead if supported. Also, `<` at the start of the regex is a typo


> variable length lookbehind, which isn't supported by Perl

It is supported: https://perldoc.perl.org/perl5300delta#Core-Enhancements

Work-around for all previous versions: http://www.drregex.com/2019/02/variable-length-lookbehinds-a...


Good to know that support is being worked upon. Seems limited as of now, this is what I get on v5.32:

    $ echo 'asb34 asgfdhgree56 jdf42' | perl -lne 'print join "\n", /(?<=\ba[a-z]*)\d+/g'
    Lookbehind longer than 255 not implemented in regex m/(?<=\ba[a-z]*)\d+/ at -e line 1.


Coming from a Perl background I do love regex but this is just the wrong use for it.

1. Those expressions fall down on a number of edge cases (eg heredocs / string literals)

2. It is actually perfectly reasonable to have non-ASCII inside source code because these days it's increasingly common for compilers to assume source it UTF8 and, frankly, English isn't the first language for most of the speaking world so it's rude of us to assume all comments and quoted strings should be in English.

3. There's already multiple code formatters for nearly every language out there and they take tokenised versions of your source (ie will take into account stuff like the string literals I mentioned earlier). You're far better off picking whatever code formatter is considered (de facto) standard for your language and using that. They come with the additional benefit that they don't just highlight formatting faux pas but they can also enforce style guidelines.


At my former job, the dev team wasn't allowed to use a formatter, so people would often commit code that would be off by a space or three. Similar to the post, I got into the habit of detecting these with this:

  ^( {4})\* {1,3}\S
There were cases where the backend would generate a stack trace, but HTML-format it in a response that was not expecting HTML. I could select and delete them with this, which would make the results instantly more readable:

  <[^>]+>


Wasn't allowed to??? A giant hand smote your head if you typed the code format command?


Metaphorically, yes. I mostly built off of existing code while I was there. Frequently (and not just for myself), I would see code getting held up in review because the more tenured developers who looked at it would say that the formatting was beyond the scope of the changes being made and would refuse to approve anything until all of the whitespace was reverted. This same class of devs would also commit new changes with more cases of these annoyances, such as trailing whitespace and mixed tabs/spaces, but were able to merge their unreviewed stuff in without management batting an eye.


This just gave me an idea. I’ve been on teams like this. I’ve also been on teams at the other extreme, where formatting nit picking took up the vast majority of review.

The former is obviously painful because poorly formatted code is a constant mental burden. But it’s also true that unrelated formatting changes can present a similar mental burden and make meaningful changes harder to identify.

The latter can detract from more meaningful review. But prioritizing better formatting consistently keeps a codebase more manageable over time.

This is somewhere tooling can help, more than it already does. Automated linting/formatting is great. Depending when it’s applied, and how effective its automation, it can potentially defer the burden til review time. But there’s still a lot of room for improvement there.

Many diff tools present an “ignore whitespace” option. I would like to see this taken two steps further:

1. Better language syntax awareness could go beyond whitespace as formatting-specific “noise”. I know there’s work in this direction and it’s great too. Besides making it more available and complete...

2. “Ignore” is probably not the best option. Ideally tooling can also identify these two sets of changes so they can be both viewed, and more importantly merged separately.

The problem right now is that formatting is noise pretty much everywhere one might interact with a project. Making it a distinct set of changes at the diff and history level would help significantly.


That would certainly help to curb the problem. On the flip side of that, a couple of things that have helped to stop it before it gets to that level are to ask the developers to enable "auto-trim whitespace" and to enable the option to show whitespace and/or hidden characters. Most editors can do these things. They go a long way toward stomping out the problem, even if auto-formatters aren't being used.

Highlighting trailing spaces are an option in git diffs too, though the devs who submit them typically don't look at their own diffs after they've pushed them (at least in my experience)


I generally dislike getting commits to review with a bunch of unrelated formatting changes around it, so I kind of get that pushback. As a lead I've asked people to format PRs so at least they're separate commits to review. That said, even I set my editors to clip EOL whitespace and add a linefeed on save if need be for every file I touch, so I guess I can't be too rigid about it.

I do think formatters work great (much better than format linters) but only if everyone does them on landing from a shared conf. Otherwise you end up playing format tennis and a lot of expression-normalization noise if it's working from AST.

I think the point where it's rewriting braces, adding parens, etc, is the point where I'd typically want to see the reformat split out for review. I haven't seen a code review tool in a long time that didn't let you hide whitespace-only changes, though, so not sure I think pushback would be justified there if the files were being touched anyway.


It's not uncommon for configurations to be included in a codebase. Anyone new to a project who uses an alternate editor can commit a config with the same settings, too.

One thing we're running into at my current company is that there are pre-commit auto-formatters that are employed that will take care of this for us. It's really convenient. The only drawback is that this process was introduced many years after the codebase was created, so there are still occasional lingering files that will pull in formatting changes once someone changes even a single character. The upside is that this isn't very frequent at this point, and we're kind of expecting them to happen on occasion.

A couple of devs have also been planning on running the entire codebase through the formatter all in one go, which will happen during a lull. I've seen another company do this once, and while the result is nice, it always ends up with someone getting frustrated with having to iron out conflicts in some branch they've been working on. (And of course, another drawback is that it complicates the blame process, but that might not be a huge problem depending on the codebase.)


For the one that can't stand the cognitive overhead of regex like me, verbal regex is a major improvement: https://github.com/VerbalExpressions/JavaVerbalExpressions


I don’t see how that’s less cognitive overhead. It uses words, but not only are some of the choices individually dubious for what they do (using domain specific symbols isn’t higher cognitive load than using common words with no obvious domain specific meanings), it breaks a cardinal rule of naming in an API by using different parts of speech to name things of the same kind, mixing verbs (“find”) and nouns (“anything”).

It’s way more verbose, but not, in any way I can tell, lower cognitive load.


Even as structured regular expression languages go it's quite a poor example since it doesn't capture any of the, uh, structure, of the regular expression. In the end it still has all the downsides of the linear string syntax (and none of the upsides - it's not composable at all either, for example), isn't really clearer, and it'll be immensely slower.


I can't imagine finding

    VerbalExpression regex = regex()
                    .find("a")
                    .capture().find("b").anything().endCapture().then("cd").build();
To involve less "cognitive overhead" than

    /a(b.*)cd/
For example, what is "anything"? (In this case, "anything" can be nothing.) Where does the capture group start/end? (In normally regexp syntax even naive brace matching will usually tell you.)


A personal common favorite:

    "(?:[^\\"]|\\.)*"


Disclaimer: none of what follows is criticism of you, of your contribution, or even of regex. I’m actually tickled by such a terse comment giving me so much to think through.

- - -

This is my fourth attempt at understanding the intent of this pattern. I’m quite comfortable with regex, so much so I listed it on my resume skills list for years as sort of a glib joke. I think I’ve been overthinking this one. For my benefit I’m gonna describe what I think this matches and ask you to confirm. For others I hope the description helps.

This appears to be intended to match a balanced double quote, where any characters between may be anything besides a backslash or double quote, or any character preceded by a backslash. The content between the bounding quotes isn’t captured, which is to say parentheses usually designate a capturing binding like $1 or \1 depending on language/context, but the ?: tells the regex engine you don’t need/want that binding.

If I’m understanding all of that correctly, it’s probably very close to what you intended but definitely highlights two of the biggest problems with regex:

1. Intent is easily lost on read. This is trite to say, but I’m pointing it out as someone who’s played regex golf for fun and still spent over an hour getting confident enough to describe what this short pattern appears to do.

2. Balanced pairs allowing an escaped terminator are not “regular language”. It’s possible (and I think probable but I’m not at a device capable to confirm) that this pattern is safe because the balanced pairs are single-character and there’s only one possible escape. But the fact that I can’t say so with confidence is something that would make me cautious using it.

All of that said, if I were writing this. I’d try to make it both more clear in intent and a little more predictably resilient.

1. I’ll include slash delimiters to remove ambiguity.

    /"(?:[^\\"]|\\.)*"/
2. I’ll express the escape exception first:

    /"(?:\\.|[^\\"])*"/
3. I’ll remove the need to check unescaped characters by terminating the match on the first unescaped quote (using ? after a wildcard is non-greedy and stops at the first match; the default is to match to the last):

    /"(?:\\.|.)*?"/
I think this makes me feel a little more confident with the escaping, but I’m still iffy without testing it. But it does make for a pretty cool emoticon!

Final thoughts:

- Almost everywhere you could use regex will not match any kind of line breaks. That might be desirable but depends on what you’re trying to do.

- Even with the abundance of caution I expressed, this is decidedly unsafe for processing user input. For example I’m certain this would be an XSS vulnerability.


> 1. Intent is easily lost on read.

In actual code, it is of course attached to some definition like `type: TOK_STRING`.

> 2. Balanced pairs allowing an escaped terminator are not “regular language”

Are you confusing this with that e.g. the language of validly parenthesized expressions is not regular? A single string with the option to escape any character including the escape character and terminator is regular.

    /"(?:\\.|.)*?"/
This makes it impossible to include a " in the string; the original regexp works because it seeks the longest match. At this point you might as well have written

    /".*?"/
or equivalently for a greedy match,

    /"[^"]*"/
> I’m certain this would be an XSS vulnerability.

This feels like a big leap for something I didn't say was even involved in web development. I see ways that a failure to handle an invalid string (and this will match invalid string in languages that have additional escape forms with stricter requirements e.g. `\u1234`) could be chained with other bugs to create an XSS, but it would be difficult to ascribe the XSS to this part. And extension to handle most such things wouldn't make the language non-regular.

(p.s.

> I’ll include slash delimiters to remove ambiguity.

The two languages I have most often used this in don't use slash delimiters, so I think this has only added ambiguity - now we are considering it embedded in some specific-but-unspecified language's regexp syntax where we might need to consider another level of escaping - compared to letting the regex stand "bare".)


> In actual code, it is of course attached to some definition like `type: TOK_STRING`.

Yes, naming helps. I was mostly sympathizing with the people whose brains explode at the sight of a regex. It’s surprising how much time it took me to be sure I understood your simple pattern.

> Are you confusing this with that e.g. the language of validly parenthesized expressions is not regular? A single string with the option to escape any character including the escape character and terminator is regular.

This is a good correction, thank you. I was thinking of cases where balanced pairs allow both escaped delimiters and nested pairs.

> This feels like a big leap for something I didn't say was even involved in web development.

I hope it’s clear I didn’t intend to make that leap, and was just generally reflecting on the ways simple patterns can go awry. That’s why I said “for example” and “would”. If someone were to take this pattern or any of the variants and use it in a context that writes innerHTML, it would be vulnerable to XSS by failing to account for multi-character tokens like &quot; in its escaping.

> The two languages I have most often used this in don't use slash delimiters, so I think this has only added ambiguity - now we are considering it embedded in some specific-but-unspecified language's regexp syntax where we might need to consider another level of escaping - compared to letting the regex stand "bare".)

I understand that many languages don’t use slash delimiters. But it’s quite common to use them for regex where a language context isn’t assumed. My point wasn’t to impose slashes on your example but to make it explicitly knowable, in words, whether the quotes were intended as part of the pattern. Not knowing that was the primary reason I wasn’t sure I understood the pattern’s intent.


This is slightly wrong, depending on if your regexp engine requires non-greedy matches to find the shortest possible match or not. (I rarely use non-greedy matches because of this, among other reasons.)

    /"(?:\\.|.)*?"/
May either stop at the first ", even if escaped (the \ will match the . case), or it will continue to the last " even if escaped (for the same reason). If anything, you've introduced a vulnerability by breaking the escape handling.


I’m open to the possibility this will vary based on engine, but I don’t think it should. The pattern specifies:

1. Match a quote.

2. Match the shortest sequence of:

  A) any character preceded by a backslash, or

  B) any thus far unmatched character, until
3. Matching the first unmatched quote.

The 2.A case should match before laziness kicks in. The only case when it shouldn’t is if the backslash has already been consumed by a previous escape.


On the main issue of your regexp being wrong: You can feed your regexp into any engine and see it will happy match

    "\"
which it should not.

As for the issue of varying between engines: The classic deviation between POSIX and non-POSIX regexp is whether you pick leftmost-longest and tiebreak by alternate order, or leftmost first alternate. (POSIX is the correct answer because there are benefits to commutative alternation.) Non-greedy regexp has a similar issue for whether it should prefer shortest or first alternate. (Proper thinking about regexes doesn't involve temporal relations like "before" or "thus far" or whether something is "consumed" - even in engines that don't implement leftmost-longest this can only confuse.)

Engines that take alternation order into account will also have it match the complete string of:

    "\""
Probably to the confusion of anyone who read that non-greedy operators will choose the shortest match.

tl;dr: Don't use non-greedy operators. Don't think about regular language matching in temporal or procedural terms. Don't try to rewrite regexps without test cases.


Nice! Added a few of these to my Anki deck.


~"Now we have two problems."


The person who originally said 'Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.' is Jamie Zawinski. And it's an amusing quote, but, that doesn't mean it's generally correct. In the original context, this was more about Zawinski's utter hatred for Perl. http://regex.info/blog/2006-09-15/247




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: