Hacker News new | past | comments | ask | show | jobs | submit login

This is simply not absolutely correct, only conditionally so.

The only way in which

  "Tarzan"|(Tarzan)
extracts a Tarzan that is not in quotes is when it is used for scanning the input for non-overlapping matches.

(We know from lexical analysis with regexes, a form of non-overlapping extraction, that the "Tarzan" token is different from a Tarzan token. An identifier won't be recognized if it is in the middle of a string literal.)

It's not the regex itself, but a particular way of using it.

If the regex is used for finding all maximally long matching substrings, then it won't work. It will find "Tarzan" and it will find the Tarzan also within those quotes.

Notably, the regex will also fail if it is used to find a single match, like the leftmost. If the datum is a string like

   "Tarzan", said Jane; Tarzan turned.
then the leftmost "Tarzan" will be found, and that's it. The regex will not find the leftmost Tarzan that is not wrapped in quotes.

We cannot even use this to simply grep files for lines that have Tarzan that is not in quotes.




regex101.com says it works, by returning multiple matches, only one of which has a group 1.

But I don't what environments return multiple matches from one evaluation.


The matches must be non-overlapping, because Tarzan is contained in "Tarzan". Therefore, the input

  "Tarzan", she said.
contains two matches for the regex

  "Tarzan"|Tarzan.
The first match is at character 0, for the "Tarzan" branch of the regex. The second match is at character 1 for the Tarzan branch of the regex.

If matches can be overlapping, then the inner Tarzan is matched in spite of being surrounded in quotes, and the capture register is bound and all.

This works not as a property of the regex (what it matches), but the regex combined with a scanning algorithm that extracts non-overlapping matches.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: