It might be the intended behavior of Hyperscan but it really feels like a bug in...

burntsushi · 2023-12-31T14:37:19.000000Z

I don't disagree. It's why I brought this up. It's tricky to use Hyperscan, as-is, as a regex engine in a grep tool for these reasons. I don't mean to claim it is impossible, but there are non-trivial issues you'll need to solve.

It's hard to learn too much from hypergrep. It still has some rough spots:

    $ hgrep -o 'foo.*bar' foobarbar.txt
    foobarbar.txt
    1:[Omitted long line with 1 matches]

    $ hgrep -M0 -o 'foo.*bar' foobarbar.txt
    Too few arguments

    For more information try --help
    $ hgrep -M 0 -o 'foo.*bar' foobarbar.txt
    foobarbar.txt
    1:[Omitted long line with 1 matches]

    $ hgrep -M 0 'foo.*bar' foobarbar.txt
    foobarbar.txt
    1:[Omitted long line with 1 matches]

    $ hgrep -M0 'foo.*bar' foobarbar.txt
    terminate called after throwing an instance of 'std::invalid_argument'
      what():  pattern not found
    zsh: IOT instruction (core dumped)  hgrep -M0 'foo.*bar' foobarbar.txt

Another issue with Hyperscan is that if you enable HS_FLAG_UTF8[1], which hypergrep does[2,3], and then search invalid UTF-8, then the result is UB.

> This flag instructs Hyperscan to treat the pattern as a sequence of UTF-8 characters. The results of scanning invalid UTF-8 sequences with a Hyperscan library that has been compiled with one or more patterns using this flag are undefined.

That's another issue you'll need to grapple with if you use Hyperscan. PCRE2 used to have this issue[4], but they've since defined the semantics of searching invalid UTF-8 with Unicode mode enabled. ripgrep 14 uses that new mode, but I haven't updated that FAQ answer yet.

Hyperscan isn't alone. Many regex engines do not support searching arbitrary byte sequences[5]. And this is why many/most regex engines are awkward to use in a fast grep implementation. Because you really do not want your grep to fall over when it comes across invalid UTF-8. And the overhead of doing UTF-8 checking in the first place (and perhaps let you just skip over lines that contain invalid UTF-8) would make it difficult to be competitive in performance. It also inhibits its usage in OSINT work.

[1]: https://intel.github.io/hyperscan/dev-reference/api_files.ht...

[2]: https://github.com/p-ranav/hypergrep/blob/ee85b713aa84e0050a...

[3]: https://github.com/p-ranav/hypergrep/blob/ee85b713aa84e0050a...

[4]: https://github.com/BurntSushi/ripgrep/blob/master/FAQ.md#why...

[5]: https://github.com/BurntSushi/rebar/blob/96c6779b7e1cdd850b8...