Hacker News new | past | comments | ask | show | jobs | submit login

You're being downvoted and I agree that this is kind of overblown, but there is something here. This particular issue had nothing to do with readability specifically, but it had to do with the fact that the unpronounceable symbols ^ and $ had a specific meaning that was not what the devs expected. If we were using a more verbose pattern-matching DSL, we would probably have operators with names like "line_end" and "string_end", which don't require you to carefully cross-check the documentation in order to understand.

Personally I love regex, but only because I'm good at it and I generally have a good memory for obscure trivia.




> but it had to do with the fact that the unpronounceable symbols ^ and $ had a specific meaning that was not what the devs expected.

What's worse is that ^ and $ have different meanings depending on whether you're using "single-line" or "multi-line" mode. From a quick web search, it seems Ruby always uses "multi-line" mode, while most other languages use "single-line" mode by default and have a flag to switch to "multi-line" mode. Someone who learned regex in other languages might not notice this difference, since most of the time the text being matched has no newlines, and so expect ^ and $ to match the boundaries of the text unless told otherwise by a "multi-line" flag.


I like regex too, but only for use in interactive contexts where you can verify the results (editors, search engines, etc). It's quite like Bash in that regard. Good for when you want to get a lot done without a lot of typing and you don't care if it only works on the input you have in front of you. A terrible idea everywhere else.

I also agree that more verbose syntax would help a lot. I've seen quite a few attempts to do that recently (e.g. the project formerly known as Rulex).


Personally I use https://regex101.com to test and validate any nontrivial regex, and then I actually put a permalink to the "saved regex" in a comment in the code, so any future viewer (including myself) can review it. I also occasionally put patterns into their own standalone objects or functions (depending on the language), which allows you to test them right in your test suite.

I also make extensive use of the "verbose mode" in Python. Adapted from the example in https://docs.python.org/3/howto/regex.html, compare this:

    pattern = re.compile(r"^\s*&#(0[0-7]+|[0-9]+|x[0-9a-fA-F]+)\s*;\s*$")
and this one attempt to clean it up:

    pattern = re.compile(
        "^\s*"
        "&#("
        "0[0-7]+"
        "|[0-9]+"
        "|x[0-9a-fA-F]+"
        ")\s*;\s*$"
    )
to this:

    pattern = re.compile(r"""
      ^\s*
      &[#]                 # Start of a numeric entity reference
        (
            0[0-7]+        # Octal form
          | [0-9]+         # Decimal form
          | x[0-9a-fA-F]+  # Hexadecimal form
        )
      \s*;                 # Trailing semicolon
      \s*$
    """,
    re.VERBOSE)
It's still not ideal, but for me it's a good balance between terseness (greater information density) and readability.

The equivalent in Pomsky (I think this is the one that was formerly Rulex? https://pomsky-lang.org/) would be very similar:

    Start [s]*
    '&#'    # Start of a numeric entity reference
    (
      # Octal form
        '0' ['0' - '7']+
      # Decimal form
      | ['0' - '9']+
      # Hexadecimal form
      | 'x' ['0' - '9' 'a' - 'f' 'A' - 'F']+
    )
    [s]* ';'    # Trailing semicolon
    [s]* End
and arguably more verbose, due to the mandatory quotation marks. Note that Pomsky actually inherits the ambiguity of "Start" and "End" that led to this security bug in the first place!

Pomsky gets you a few other advantages, e.g. compatibility and polyfills across different regex engines, but the similar syntax I think goes to show how dramatic of an improvement "verbose regex" mode can be.

Finally, you have "English-like" DSLs more akin to my original suggestion, as in ReadableRegex.jl (https://github.com/jkrumbiegel/ReadableRegex.jl). I'm not sure how you'd construct the above pattern in that DSL, but I am sure that you would trade away information density and a sense of overall structure, and gain increased clarity of each individual operation. Set your priorities accordingly.


Yeah the Pomsky one is already way better because you can easily see that &# are literal characters, not some weird regex thing you've forgotten about.

That's one of the biggest issues with regex - mixing up data and control.

But I would still expect a robust codebase to have a proper number parser if you want to parse this sort of thing.


What is regex but shorthand notation for a parser?

I agree that a good codebase should generally have its regex segregated into standalone functions with their own tests (ideally property-based tests!).


a better practice in ruby is to use the \A and \z anchors for beginning of string and end of string, ^ and $ are beginning and end of line in ruby as far as I know




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: