Ugrep – a more powerful, fast, user-friendly, compatible grep

graphe · 2023-12-30T21:47:32.000000Z

Here's a thread on performance vs rg (ripgrep). https://github.com/BurntSushi/ripgrep/discussions/2597 didn't know about hypergrep either.

ashvardanian · 2023-12-31T00:48:27.000000Z

Haven't benchmarked *grep implementations, but assuming those are just CLI wrappers around RegEx libraries, I'd expect the RegEx benchmarks to be broader and more representative.

There, hyperscan is generally the king, which means hypergrep numbers are likely accurate: https://github.com/p-ranav/hypergrep?tab=readme-ov-file#dire...

Disclaimer: I rarely use any *grep utilities, but often implement string libraries.

burntsushi · 2023-12-31T00:58:40.000000Z

I'm the author of ripgrep and its regex engine.

Your claim is true to a first approximation. But greps are line oriented, and that means there are optimizations that can be done that are hard to do in a general regex library. You can read more about that here: https://blog.burntsushi.net/ripgrep/#anatomy-of-a-grep (greps are more than simple CLI wrappers around a regex engine).

If you read my commentary in the ripgrep discussion above, you'll note that it isn't just about the benchmarks themselves being accurate, but the model they represent. Nevertheless, I linked the hypergrep benchmarks not because of Hyperscan, but because they were done by someone who isn't the author of either ripgrep or ugrep.

As for regex benchmarks, you'll want to check out rebar: https://github.com/BurntSushi/rebar

You can see my full thoughts around benchmark design and philosophy if you read the rebar documentation. Be warned though, you'll need some time.

There is a fork of ripgrep with Hyperscan support: https://sr.ht/~pierrenn/ripgrep/

Hyperscan also has some preculiarities on how it reports matches. You won't notice it in basic usage, but it will appear when using something like the -o/--only-matching flag. For example, Hyperscan will report matches of a, b and c for the regex \w+, where as a normal grep will just report a match of abc. (And this makes sense given the design and motivation for Hyperscan.) Hypergrep goes to some pain to paper over this, but IIRC the logic is not fully correct. I'm on mobile, otherwise I would link to the reddit thread where I had a convo about this with the hypergrep author.

haberman · 2023-12-31T02:46:28.000000Z

> I'm on mobile, otherwise I would link to the reddit thread where I had a convo about this with the hypergrep author.

From some searching I think you might mean this: https://www.reddit.com/r/cpp/comments/143d148/hypergrep_a_ne...

burntsushi · 2023-12-31T02:50:17.000000Z

Ah yup! I just posted a follow-up that links to that with an example (from a build of hypergrep off of latest master): https://news.ycombinator.com/item?id=38821321

burntsushi · 2023-12-31T02:48:58.000000Z

OK, now that I have hands on a keyboard, this is what I meant by Hyperscan's match semantics being "peculiar":

    $ echo 'foobar' | hg -o '\w{3}'
    1:foobar
    $ echo 'foobar' | grep -E -n -o '\w{3}'
    1:foo
    1:bar

Here's the aforementioned reddit thread: https://old.reddit.com/r/cpp/comments/143d148/hypergrep_a_ne...

I want to be clear that these are intended semantics as part of Hyperscan. It's not a bug with Hyperscan. But it is something you'll need to figure out how to deal with (whether that's papering over it somehow, although I'm not sure that's possible, or documenting it as a difference) if you're building a grep around Hyperscan.

jiripospisil · 2023-12-31T14:15:44.000000Z

It might be the intended behavior of Hyperscan but it really feels like a bug in Hypergrep to report the matches like this - you cannot report a match which doesn't fully match the regex...

I also wonder if there's a performance issue when matching a really long line because Hyperscan is not greedy and will ping back to Hypergrep for every sub match. I guessing this is the reason for those shenanigans in the callback [0].

  $ python -c 'print("foo" + "bar" * 3000)' | hg -o 'foo.*bar'

[0] https://github.com/p-ranav/hypergrep/blob/ee85b713aa84e0050a...

burntsushi · 2023-12-31T14:37:19.000000Z

I don't disagree. It's why I brought this up. It's tricky to use Hyperscan, as-is, as a regex engine in a grep tool for these reasons. I don't mean to claim it is impossible, but there are non-trivial issues you'll need to solve.

It's hard to learn too much from hypergrep. It still has some rough spots:

    $ hgrep -o 'foo.*bar' foobarbar.txt
    foobarbar.txt
    1:[Omitted long line with 1 matches]

    $ hgrep -M0 -o 'foo.*bar' foobarbar.txt
    Too few arguments

    For more information try --help
    $ hgrep -M 0 -o 'foo.*bar' foobarbar.txt
    foobarbar.txt
    1:[Omitted long line with 1 matches]

    $ hgrep -M 0 'foo.*bar' foobarbar.txt
    foobarbar.txt
    1:[Omitted long line with 1 matches]

    $ hgrep -M0 'foo.*bar' foobarbar.txt
    terminate called after throwing an instance of 'std::invalid_argument'
      what():  pattern not found
    zsh: IOT instruction (core dumped)  hgrep -M0 'foo.*bar' foobarbar.txt

Another issue with Hyperscan is that if you enable HS_FLAG_UTF8[1], which hypergrep does[2,3], and then search invalid UTF-8, then the result is UB.

> This flag instructs Hyperscan to treat the pattern as a sequence of UTF-8 characters. The results of scanning invalid UTF-8 sequences with a Hyperscan library that has been compiled with one or more patterns using this flag are undefined.

That's another issue you'll need to grapple with if you use Hyperscan. PCRE2 used to have this issue[4], but they've since defined the semantics of searching invalid UTF-8 with Unicode mode enabled. ripgrep 14 uses that new mode, but I haven't updated that FAQ answer yet.

Hyperscan isn't alone. Many regex engines do not support searching arbitrary byte sequences[5]. And this is why many/most regex engines are awkward to use in a fast grep implementation. Because you really do not want your grep to fall over when it comes across invalid UTF-8. And the overhead of doing UTF-8 checking in the first place (and perhaps let you just skip over lines that contain invalid UTF-8) would make it difficult to be competitive in performance. It also inhibits its usage in OSINT work.

[1]: https://intel.github.io/hyperscan/dev-reference/api_files.ht...

[2]: https://github.com/p-ranav/hypergrep/blob/ee85b713aa84e0050a...

[3]: https://github.com/p-ranav/hypergrep/blob/ee85b713aa84e0050a...

[4]: https://github.com/BurntSushi/ripgrep/blob/master/FAQ.md#why...

[5]: https://github.com/BurntSushi/rebar/blob/96c6779b7e1cdd850b8...

kazinator · 2023-12-31T05:41:40.000000Z

How about: use Hyperscan to round up all the lines that contain matches, and process those again with regex for the "-o" semantics.

burntsushi · 2023-12-31T12:30:58.000000Z

You mean two different regex engines for the same search? That is perhaps conceptually fine, but in practice any two regex engines are likely to have differences that will make that strategy fall apart in some cases. (Perhaps unless those regex engines rigorously stick to a spec like POSIX or ecmascript. But that's not the case here. IIRC Hyperscan meticulously matches the behavior of a subset of PCRE2, but ripgrep's default engine is not PCRE2.)

You could perhaps work around this by only applying it as an optimization when you know the pattern has identical semantics in both regex engines. But you would have to do the work to characterize them.

I would rather just make the regex crate faster. If you look at the rebar benchmarks, it's not that far behind and is sometimes even faster. The case where Hyperscan really destroys everything else is for searches for many patterns.

Hyperscan has other logistical issues. It is a beast to build. And its pattern compilation times can be large (again, see rebar). Hyperscan itself only supports x86-64, so one would probably want to actually use Vectorscan (a fork of Hyperscan that supports additional architectures).

cozzyd · 2023-12-31T06:23:01.000000Z

is that an alias, or does hypergrep really use the same command name as mercurial?

tedunangst · 2023-12-31T06:29:11.000000Z

It was renamed: https://github.com/p-ranav/hypergrep/commit/ee85b713aa84e005...

infocollector · 2023-12-31T00:02:03.000000Z

I think you should try it before you read these conflicting benchmarks from the authors: https://github.com/Genivia/ugrep-benchmarks

1vuio0pswjnm7 · 2023-12-31T08:41:12.000000Z

rg uses a lot of memory in the OpenSubtitles test. 903M vs 29M for ugrep. Unlike the previous test, we are not told the size of the file being searched.

Would be interesting to see comparisons where memory is limited, i.e., where the file being searched will not fit entirely into memory.

Personally I'm interested in "grep -o" alternatives. The files I'm searching are text but may have few newlines. For example I use ired instead of grep -o. ired will give the offsets of all matches, e.g.,

      echo /\"something\"|ired -n 1.htm

Quick and dirty script, not perfect:

      #!/bin/sh
      test $# -gt 0||echo "usage: echo string|${0##*/} file [blocksize] [seek] [match-no]"
      {
      read x;
      x=$(echo /\""$x"\"|ired -n $1|sed -n ${4-1}p); 
      test "$x"||exit 1;
      echo
      printf s"$x"'\n's-${3-0}'\n'x$2'\n'|ired -n $1;
      echo;
      printf s"$x"'\n's-${3-0}'\n'X$2'\n'|ired -n $1;
      echo;
      echo w$(printf s"$x"'\n's-${3-0}'\n'X$2'\n'|ired -n $1)|ired -n /dev/stdout;
      echo;
      }

Another script I use loops through all the matches.

burntsushi · 2023-12-31T12:39:54.000000Z

> rg uses a lot of memory in the OpenSubtitles test. 903M vs 29M for ugrep. Unlike the previous test, we are not told the size of the file being searched.

Which test exactly? That's just likely because of memory maps futzing with the RSS data. Not actually more heap memory. Try with --no-mmap.

I'm not sure I understand the rest of your comment about grep -o. Grep tools usually have a flag to print the offset of each match.

EDIT: Now that I have hands on a keyboard, I'll demonstrate the mmap thing. First, ugrep:

    $ time ugrep-4.4.1 -c '\w+\s+Sherlock\s+Holmes\s+\w+' sixteenth.txt
    72

    real    22.115
    user    22.015
    sys     0.093
    maxmem  30 MB
    faults  0
    $ time ugrep-4.4.1 -c '\w+\s+Sherlock\s+Holmes\s+\w+' sixteenth.txt --mmap
    72

    real    21.776
    user    21.749
    sys     0.020
    maxmem  802 MB
    faults  0

And now for ripgrep:

    $ time rg-14.0.3 -c '\w+\s+Sherlock\s+Holmes\s+\w+' opensubtitles/2018/en/sixteenth.txt
    72

    real    0.076
    user    0.046
    sys     0.030
    maxmem  779 MB
    faults  0
    $ time rg-14.0.3 -c '\w+\s+Sherlock\s+Holmes\s+\w+' opensubtitles/2018/en/sixteenth.txt --no-mmap
    72

    real    0.087
    user    0.033
    sys     0.053
    maxmem  15 MB
    faults  0

It looks like the difference here is that ripgrep chooses to use a memory map by default here. I don't think it makes much of a difference here.

If the file were bigger than available memory, then the OS would automatically handle paging.

1vuio0pswjnm7 · 2024-01-11T02:43:49.000000Z

ripgrep is not for me.

burntsushi · 2024-01-11T03:19:59.000000Z

I never argued otherwise. Especially since you clearly don't mind false negatives. ;-)

1vuio0pswjnm7 · 2024-01-08T21:24:17.000000Z

task: printing non-repeating patterns in relatively small files to the screen, optionally with some context

context should be printed exactly as it appears in the file, i.e., newlines should be printed

ired vs ripgrep, which one is better suited for this task

one uses regular expressions, the other does not

one is a 76k static binary that fits in 2MB L2 cache, the other is a 5.7MB dynamically-linked binary

2 shell scripts to demonstrate differences

usage: echo pattern|1.sh [num chars before] [num chars after]

1. "1.sh" using 5.7MB binary, PCRE2

      #!/bin/sh
      read x;
      case $# in :)
      ;;0)exec echo "usage: ${0##*/} file [num chars before] [num chars after]"
      ;;1)exec rg -uuu -no-unicode --block-buffered --color=never -NUo "$x" $1
      esac
      case $# in 2|3)printf "((.)|(\\\\n)|(\\\\r)){"$2"}$x((.)|(\\\\n)|(\\\\r)){"${3-0}"}";esac \
      |rg -f/dev/stdin -uuu --no-unicode --block-buffered --color=never -NUo $1

2. "2.sh" using 76k static binary, no regular expressions

      #!/bin/sh
      read x;
      len=${#x};
      b=$(($2+$len));
      case $# in 0)exec echo "usage ${0##*/} file [num chars before] [num chars after]"
      ;;2)b=$(($2+$len))
      ;;3)b=$(($3+$2+$len))
      esac
      echo "$x" > .x
      { printf /;ired -n -c X1 .x;} \
      |ired -n $1 \
      |sed  "s/.*/s&@s-${2-0}@b$b@X/;" \
      |tr @ '\12' \
      |ired -n $1 \
      |sed 's/.*/w&0a/' \
      |ired -n /dev/stdout \
      |sed -e '/^Invalid hexpair/d'

generate test data:

      curl -4si0 -A "" https://www.google.com > test.html

usage example: find the pattern "(" in test.html, display results to screen

      echo \(|1.sh test.html
 
      regex parse error:
          (?:()
          ^
      error: unclosed group
 
      echo '[(]'|1.sh test.html
 
      echo \(|2.sh test.html
 
      cat .x

observation: the task is simple but 1.sh may require more typing and knowledge of regular expressions

observation: 2.sh does not require knowledge of PCRE; the pattern requires no extra chars, e.g., brackets

usage example: find the pattern "(a" in test.html, display results to screen with 0 chars before and 3 chars after

     echo '[(]a'|1.sh test.html 0 3|sed -n l|less -N

     echo \(a|2.sh test.html 0 3|sed -n l|less -N

observation: 1.sh does not include the newline after match #187; some workaround is required for 1.sh

conclusion: for me, ripgrep is too large and complicated for this simple task involving relatively small files; it's overkill. it does not feel any faster than ired at the command line. in fact, it feels slower. like python or java, or other large rust/go binaries, there is a small initial delay, a jank. whereas ired feels very smooth.

burntsushi · 2024-01-08T22:10:18.000000Z

I love how you continue to ignore the fact that ired produces incorrect results.

Also:

You can use -F to make the argument to ripgrep be interpreted as a literal. No knowledge of regex is needed. It's a standard grep flag.

You also aren't using PCRE. You're using ripgrep's default engine, which is the regex crate. You need to pass -P to use PCRE2. Although I don't see the point in doing so.

I find your overall comparison here to be disengenuous personally. You can't even be arsed to acknowledge that ired returns incorrect results. And every benchmark I've run has shown ripgrep to be faster or just as fast. There's no jank.

I already acknowledged that the rg binary is beefy. It is actually statically linked by default (although it may dynamically link C libraries). I don't care if rg is 5MB. If you do, then rg isn't for you. You can keep using broken software instead.

1vuio0pswjnm7 · 2024-01-09T05:58:31.000000Z

   xbps-query -RS ripgrep |sed -n 11,21p

      pkgname: ripgrep
      pkgver: ripgrep-14.0.3_1
      repository: https://repo-default.voidlinux.org/current/musl
      run_depends:
      libgcc>=4.4.0_1
      libpcre2>=10.22_1
      musl>=1.1.24_7
      shlib-requires:
      libc.so
      libgcc_s.so.1
      libpcre2-8.so.0

It would be nice to have a ripgrep without libpcre2.

It also would be nice to use BRE by default and make ERE optional, similar to grep.

What would compiling ripgrep from source entail. Would it be as easy as compiling ired.

ired compiles in seconds and compiling requires less than 1MB of disk space. No connection to any server is required to compile the program.

Let's edit the 1.sh script to add the -F option and try our example search again to see what happens.

       #!/bin/sh
       read x;
       case $# in :)
       ;;0)exec echo "usage: ${0##*/} file [chars before] [chars after]"
       ;;1)exec rg -F --no-unicode --block-buffered --color=never -NUo "$x" $1
       esac
       case $# in 2|3)printf "((.)|(\\\\n)|(\\\\r)){"$2"}$x((.)|(\\\\n)|(\\\\r)){"${3-0}"}";esac \
       |rg -f/dev/stdin -F --no-unicode --block-buffered --color=never -NUo $1

       echo \(|1.sh test.html

       echo \(a|1.sh test.html 0 3

As expected, this produces no output.

We cannot add the surrounding context characters as literals because we do not know the identity of these characters. That is what we are attempting to find out.

Would I ever search for a repeating pattern such as \(a\(a using ired. The answer is no; I am looking for context. I would search for \(a and then add a request for context, a number characters before and/or after, as in the examples. Again, I do not know what those characters will be; that is what I am searching for. If the pattern repeats, this would be visible from viewing the context.

For line-delimited files where data is presented in a regular format, grep -A -B and -C work great for printing context. But for files that can be idiosyncratic in how they present data and/or files that lack consistent newline delimeters, for me, grep -o is inadequate for printing context.

burntsushi · 2024-01-09T06:24:01.000000Z

> It would be nice to have a ripgrep without libpcre2.

Blame your packagers for that, not me. PCRE2 is an optional dependency that isn't enabled by default. If you bothered to read the README, you'd know that.

Your packagers probably enable PCRE2 because nobody gives a fuck about a MB or two. And if you do, compile your own damn software.

> It also would be nice to use BRE by default and make ERE optional, similar to grep.

Disagree. No thanks.

> What would compiling ripgrep from source entail. Would it be as easy as compiling ired.

cargo build --release

Read the docs maybe?

> ired compiles in seconds and compiling requires less than 1MB of disk space. No connection to any server is required to compile the program.

I don't care. Go talk to the Debian people. They know how to build Rust software without connecting to the Internet. Cargo supports that workflow, but it's not the default.

ripgrep isn't minimalist software. Never was. Never will be. If that's a requirement, then tough cookies.

> The answer is no; I am looking for context.

And who knows if you'll find it with ired given that it doesn't implement substring search properly. I demonstrated this with a simple example days ago. You continue to ignore it. This is why your commentary is dishonest. You keep looking for other reasons to prop up ired. Shifting goalposts. And not even bothering to acknowledge that I already told you days ago that ripgrep and ired are two different tools. They don't target the same use cases.

Go back to the start of our conversation. All I did was correct an erroneous complaint about perf where you snubbed your nose. Since then, you've hemmed and hawed and tried to turn this into a broader discussion about whether ripgrep can replace ired. I never said it could. rg -o might be able to hit some subset of ired faster than ired, but I don't claim anything beyond that. I mean obviously ripgrep works better on line oriented data. It is a grep!

Your tunnel vision is so intense you can't even realize that you're trying to rationalize the use of broken software. Who the fuck cares if it's fast or let's you show windowed context if it can't even report every match?

Fast and user friendly doesn't mean shit if it's wrong. The fact that you can't acknowledge that reveals the dishonesty in your commentary.

1vuio0pswjnm7 · 2024-01-11T04:48:35.000000Z

The failure of grep/ripgrep to display the newline character contained in the context in match #178 could be characterised as a "false negative".

burntsushi · 2024-01-11T12:45:36.000000Z

Wrong again. The . doesn't match newlines by default. That's standard for regex. As with just about any regex engine, you need to explicitly instruct . to match newlines. In ripgrep, that means `(?s)..query..` for example. Or pass `--multiline-dotall`. Or use `\p{any}` to match the UTF-8 encoding of any Unicode scalar value. Or `(?-u:[\x00-\xFF])` to match any byte value.

In contrast, not being able to find ABAB in ABAABAB is definitely a false negative. There's no qualification needed there. You're using badly broken software. One wonders why you can't seem to bring yourself to acknowledge it as a downside.

1vuio0pswjnm7 · 2024-01-14T03:33:55.000000Z

You have failed to provide a working example that does the simple task I presented, which ired does perfectly. That's all I am looking for. A working example. You also claimed you would provide an explanation of why traditional grep -o is so slow when adding preceeding context, e.g., .{500}pattern. You never did. You decided to start whining incessantly about not being able to match substrings instead.

Again, I am not searching for repeating strings such as ABAB. I am searching for AB preceded or followed by unknown characters. Thus the "problem" you found is not one I am going to have given the way I'm using the program. Why would I care about it. It's a hex editor, not a program I use to do searches. I never searched for a repeating pattern. You did. Then you proceed to whine about it. Incessantly. Hoping to create some sort of diversion.

Further, if you were really paying attention, you would notice that in the scripts I presented I'm not using ired to search for strings (/"string"). I am searching multiple, catenated hex pairs (/737472696e67). ired is not intended to be used that way. Although it works for my purposes, ired is only intended to be used to search a single hex pair. The issue you spotted is applicable when searching _strings_, not searching a single hex pair. But I am not searching strings. I'm searching multiple hex pairs. And because the program is not intended to be used to match catenated hex pairs, I fully expected this might not work. As it happens, it works.

Unless an example is provided, it appears that using rg -o to _print_ (not just match) characters found trailing context that happen to be newlines (task #1) works about as well as using ired to search for repeating strings in a file (task #2). It does not work. This is not surprising to me since IMO there are better programs to do those tasks. As stated in the very first comment I made, I am interesed in programs that perform task #1. I do not need solutions for task #2. Despite what you may be selling.

You cannot produce a working example.

burntsushi · 2024-01-14T04:39:28.000000Z

I already told you how to make a dot match a newline.

I explained why repeating patterns can be expensive for regex engines to execute. It's a known issue among most regex engines.

I find it funny that you whine and dance around all your little use cases, but when it comes to actual reporting correct output, you just happen to conveniently assume that none of the strings you search for produce incorrect results.

As I already said, this entire conversation started by me correcting a perf claim where you snubbed your nose. That was my only point. You then decided to talk about a grander problem that I have never engaged in. I never once said rg meets your use cases. (Your use cases seem quite strange.) I don't give a shit if you think rg can replace broken software or not.

1vuio0pswjnm7 · 2024-01-02T23:52:06.000000Z

1. Retrieve test.json

     curl -i40A "" "https://api.crossref.org/works?query=unix&rows=1000" > test.json

2. Create shell script

      #!/bin/sh
      # usage: echo string|1.sh file [blocksize] [seek]"
      read x;
      x=$(echo -n $x|od -An -tx1|tr -d '\40');
      echo /$x \
      |ired -n $1 \
      |sed "s/.*/s&@s-${3-0}@X$2/" \
      |tr @ '\12' \
      |ired -q -i /dev/stdin $1 \
      |sed 's/.*/w&0a/' \
      |ired -n /dev/stdout

We can make the script slightly faster by using busybox

      #!/bin/sh
      # usage: echo string|1.sh file [blocksize] [seek]"
      read x;
      x=$(echo -n $x|busybox od -An -tx1|busybox tr -d '\40');
      echo /$x \
      |ired -n $1 \
      |busybox sed "s/.*/s&@s-${3-0}@X$2/" \
      |busybox tr @ '\12' \
      |ired -q -i /dev/stdin $1 \
      |busybox sed 's/.*/w&0a/' \
      |ired -n /dev/stdout

NB. If redirecting output to a file, replace /dev/stdout with the file name.

ired is available on Void Linux

https://ftp.lysator.liu.se/pub/voidlinux/static/

      xbps-query.static -Rs ired-0
      xbps-install.static ired

3. Test grep v3.6, ripgrep v14.0.3 and shell script; busybox is v1.34.1

     busybox time grep -Eo .{35}https:.{4} test.json;

     busybox time rg -o .{35}https:.{4} test.json;

     busybox time sh -c "echo https:|1.sh 45 35 test.json"

We can make the script slower by using bash

     busybox time bash -c "echo https:|1.sh 45 35 test.json"

Program size

     du -h /usr/bin/grep
     216K/usr/bin/grep

     du -h /usr/bin/rg
     5.7M/usr/bin/rg

     du -hc /usr/bin/ired /bin/dash /usr/bin/tr /usr/bin/sed /usr/bin/od
     456K/bin/dash
     40K/usr/bin/ired
     56K/usr/bin/tr
     68K/usr/bin/od
     104K/usr/bin/sed
     724Ktotal

     du -h /usr/bin/busybox /usr/bin/ired
     772K/usr/bin/busybox
     40K/usr/bin/ired
     812Ktotal

     readelf -d /bin/dash /usr/bin/busybox

     File: /bin/dash

     There is no dynamic section in this file.

     File: /usr/bin/busybox

     There is no dynamic section in this file.

burntsushi · 2024-01-03T16:17:57.000000Z

OK, so I'll try your commands:

    $ busybox time grep -Eo .{35}https:.{4} test.json
    real    0m 0.15s
    user    0m 0.15s
    sys     0m 0.00s

    $ busybox time rg-14.0.3 -o .{35}https:.{4} test.json
    real    0m 0.00s
    user    0m 0.00s
    sys     0m 0.00s

    $ busybox time dash -c "echo https:|./1.sh test.json 45 35"
    real    0m 0.01s
    user    0m 0.01s
    sys     0m 0.00s

    $ busybox time bash -c "echo https:|./1.sh test.json 45 35"
    real    0m 0.00s
    user    0m 0.00s
    sys     0m 0.00s

    $ busybox time dash -c "echo https:|./busy-1.sh test.json 45 35"
    real    0m 0.00s
    user    0m 0.01s
    sys     0m 0.00s

    $ busybox time bash -c "echo https:|./busy-1.sh test.json 45 35"
    real    0m 0.01s
    user    0m 0.01s
    sys     0m 0.00s

So grep -o takes 150ms, but both ripgrep and ired are seemingly instant. But if I use zsh's builtin `time` command with my own TIMEFMT[1], it gives me numbers greater than 0:

    $ time grep -Eo .{35}https:.{4} test.json
    real    0.324
    user    0.317
    sys     0.007
    maxmem  16 MB
    faults  0

    $ time rg-14.0.3 -o .{35}https:.{4} test.json
    real    0.008
    user    0.003
    sys     0.003
    maxmem  16 MB
    faults  0

    $ time dash -c "echo https:|./1.sh test.json 45 35"
    real    0.010
    user    0.011
    sys     0.007
    maxmem  16 MB
    faults  0

    $ time bash -c "echo https:|./1.sh test.json 45 35"
    real    0.011
    user    0.014
    sys     0.004
    maxmem  16 MB
    faults  0

Would you look at that. ripgrep is faster! By a whole 2 milliseconds! WOW!

OK, since I'm a software developer and thus apparently cannot understand the lowly needs of an "ordinary user," I'll hop over to my machine with a i5-7600, which was released 6 years ago. Is that ordinary enough, or still too super charged to do any meaningful comparison whatsoever?

    $ time grep -Eo .{35}https:.{4} test.json
    real    0.641
    user    0.620
    sys     0.017
    maxmem  6 MB
    faults  0

    $ time rg-14.0.3 -o .{35}https:.{4} test.json
    real    0.010
    user    0.008
    sys     0.000
    maxmem  8 MB
    faults  0

    $ time dash -c "echo https:|./1.sh test.json 45 35"
    real    0.011
    user    0.009
    sys     0.011
    maxmem  6 MB
    faults  0

    $ time bash -c "echo https:|./1.sh test.json 45 35"
    real    0.013
    user    0.021
    sys     0.003
    maxmem  6 MB
    faults  0

(I ran the commands above each several times and took the minimum.)

OK, so ripgrep is still 1ms faster even on "ordinary user" hardware.

All right, so your other comment also shared another benchmark:

    $ time grep -Eo .{100}https:.{50} test.json
    real    1.777
    user    1.772
    sys     0.003
    maxmem  6 MB
    faults  0

    $ time rg-14.0.3 -o .{100}https:.{50} test.json
    real    0.013
    user    0.006
    sys     0.000
    maxmem  8 MB
    faults  0

    $ time rg-14.0.3 --color never -o .{100}https:.{50} test.json
    real    0.006
    user    0.006
    sys     0.000
    maxmem  8 MB
    faults  0

    $ time dash -c "echo https:|./1.sh test.json 156 100"
    real    0.015
    user    0.024
    sys     0.004
    maxmem  7 MB
    faults  0

    $ time bash -c "echo https:|./1.sh test.json 156 100"
    real    0.016
    user    0.028
    sys     0.000
    maxmem  7 MB
    faults  0

(Notice that disabling color and line numbers for ripgrep improves its speed a fair bit. ired isn't doing either of those things, so it's only fair. GNU grep doesn't count line numbers by default and disabling color doesn't improve its perf here.)

This one is more interesting because it exposes the fact that many regex engines have trouble dealing with bounded repeats. Something like `.{100}` for example is not executed particularly efficiently in most regex engines. And indeed, in ripgrep by default, `.` actually matches the UTF-8 encoding of any Unicode scalar value (so between 1 and 4 bytes) and not any arbitrary byte. You'd need to pass the `--no-unicode` flag or prefix your pattern with `(?-u)` to match any arbitrary byte. And indeed, even then, `.` doesn't match `\n`. So you might even want `(?s-u)`. But since this is a grep and *greps are line oriented*, you'd need to enable multi-line mode in ripgrep (GNU grep doesn't have this):

    $ time rg-14.0.3 -Uo '(?s-u).{100}https:.{50}' test.json
    real    0.057
    user    0.041
    sys     0.006
    maxmem  8 MB
    faults  0

    $ time rg-14.0.3 --color never -N -Uo '(?s-u).{100}https:.{50}' test.json
    real    0.042
    user    0.041
    sys     0.000
    maxmem  8 MB
    faults  0

This actually runs slower, I believe, because it disables the line oriented optimizations that ripgrep uses. In this case, it isn't as good at detecting the `https:` literal and looking for that first. That's where `ired` can do (a lot) better, because it isn't line oriented and doesn't need to support arbitrary regex patterns. greps are.

To complete this analysis, I'm going to do something that I realize is blasphemous to you and increase the input size by ten-fold. This will help us understand where time is being spent:

    $ time grep --color=never -Eo .{100}https:.{50} test.10x.json
    real    17.931
    user    17.906
    sys     0.017
    maxmem  7 MB
    faults  0

    $ time rg-14.0.3 --color never -N -o '.{100}https:.{50}' test.10x.json
    real    0.032
    user    0.017
    sys     0.010
    maxmem  23 MB
    faults  0

    $ time rg-14.0.3 --color always -N -o '.{100}https:.{50}' test.10x.json
    real    0.137
    user    0.034
    sys     0.019
    maxmem  23 MB
    faults  0

    $ time dash -c "echo https:|./1.sh test.10x.json 156 100"
    real    0.067
    user    0.089
    sys     0.069
    maxmem  7 MB
    faults  0

I compared the profiles of `rg --color=never` and `rg --color=always`, and they look about the same to me. This suggests to me that color is slower simply because rendering it in my terminal emulator is slower.

For grins, I also tried ugrep:

    $ time ugrep-4.4.1 --color=never -o '.{100}https:.{50}' test.10x.json
    real    6.003
    user    5.977
    sys     0.007
    maxmem  6 MB
    faults  0

Owch. But not as bad as GNU grep.

So with a bigger input, we can see that `rg -o` is about twice as fast as ired, even on "ordinary" hardware.

And IMO, for inputs of the size you've provided, the difference is not meaningful.

Going back to your original prompt:

> Personally I'm interested in "grep -o" alternatives.

It seems to me like `rg -o` is quite serviceable in that regard, and at the very least, substantially better than GNU grep.

At this point, I wondered what ired did for substring search[2]. That immediately stuck out to me as something that looked wrong. Indeed:

    $ cat haystack
    ABAABAB
    $ echo -n BAB | od -An -tx1 | sed 's>^>/>;s/ //g' | ired -n haystack
    0x4
    $ echo -n ABAB | od -An -tx1 | sed 's>^>/>;s/ //g' | ired -n haystack
    $ rg -o ABAB haystack
    1:ABAB

So ired is a toy. One wonders how many search results you've missed over the years because of ired's feature "it's so minimal that it's wrong!" I mean sometimes tools have bugs. ripgrep has had bugs too. But this one has been in ired since 2009.

What is it that you said? YIKES. Yeah. Seems appropriate.

[1]: https://github.com/BurntSushi/dotfiles/blob/eace294fd80bfde1...

[2]: https://github.com/radare/ired/blob/a1fa7904e6ad239dde950de5...

1vuio0pswjnm7 · 2024-01-01T03:56:53.000000Z

About grep -o.

    # stat -c %s file
    6297285

    # file file
    file: ASCII text, with very long lines (1545), with CRLF, LF line terminators

Imagine file as a wall of text.

1. Printing byte offsets.

    # time grep -ob string file

    0.03user 0.08system 0:00.22elapsed 52%CPU (0avgtext+0avgdata 1104maxresident)k
    0inputs+0outputs (0major+86minor)pagefaults 0swaps

    # rg -V
    ripgrep 13.0.0
   
    # time rg -ob string file

    0.10user 0.17system 0:01.11elapsed 25%CPU (0avgtext+0avgdata 7804maxresident)k
    0inputs+0outputs (0major+559minor)pagefaults 0swaps

    # time sh -c "echo -n string|od -An -tx1|sed 's>^>/>;s/ //g'|ired -n file"

    0.03user 0.09system 0:00.18elapsed 67%CPU (0avgtext+0avgdata 720maxresident)k
    0inputs+0outputs (0major+189minor)pagefaults 0swaps

2. Printing some "context" around the matched string. For example, add characters immediately preceding string.

Baseline.

     # time grep -o string file

     0.02user 0.07system 0:00.15elapsed 65%CPU (0avgtext+0avgdata 1068maxresident)k
     0inputs+0outputs (0major+84minor)pagefaults 0swaps

Add one character.

     # time grep -o .string file

     0.21user 0.08system 0:00.36elapsed 83%CPU (0avgtext+0avgdata 1088maxresident)k
     0inputs+0outputs (0major+87minor)pagefaults 0swaps

Add another character.

     # time grep -o ..string file

     0.29user 0.09system 0:00.46elapsed 82%CPU (0avgtext+0avgdata 1064maxresident)k
     0inputs+0outputs (0major+88minor)pagefaults 0swaps

     # time rg -o ..string file

     0.13user 0.13system 0:00.90elapsed 28%CPU (0avgtext+0avgdata 9012maxresident)k
     0inputs+0outputs (0major+574minor)pagefaults 0swaps

Yikes.

Now let's try ired. Another shell script. This one will print all occurences of string.

     cat > 1.sh << eof
     #!/bin/sh
     # usage: echo string|1.sh file [blocksize] [seek]"
     read x;
     x=$(echo -n $x|xxd -p);
     echo /$x \
     |ired -n $1 \
     |sed "s/.*/s&@s-${3-0}@X$2/" \
     |tr @ '\12' \
     |ired -q -i /dev/stdin $1 \
     |sed 's/.*/w&0a/' \
     |ired -n /dev/stdout
     eof

Baseline.

     # echo string|time sh 1.sh 6

     0.11user 0.10system 0:00.17elapsed 127%CPU (0avgtext+0avgdata 772maxresident)k
     0inputs+0outputs (0major+466minor)pagefaults 0swaps

Add one character before string.

     # echo string|time sh 1.sh 7 1

     0.12user 0.09system 0:00.16elapsed 131%CPU (0avgtext+0avgdata 740maxresident)k
     0inputs+0outputs (0major+473minor)pagefaults 0swaps

Add another.

     # echo string|time sh 1.sh 8 2

     0.12user 0.11system 0:00.20elapsed 112%CPU (0avgtext+0avgdata 744maxresident)k
     0inputs+0outputs (0major+461minor)pagefaults 0swaps

Perhaps grep or ripgrep might be slightly faster at printing byte offsets.

But ired is faster at printing matches with context. (NB. Context here means characters, not lines.)

Try using ripgrep to print offsets for ired.

    #!/bin/sh
    read x; 
    rg --no-mmap -ob $x $1 \
    |cut -d: -f1 \
    |sed "s/.*/s&@s-${3-0}@X$2/" \
    |tr @ '\12' \
    |ired -q -i /dev/stdin $1 \
    |sed 's/.*/w&0a/' \
    |ired -n /dev/stdout

    # time sh -c "echo string|1.sh file 8 2"

    0.11user 0.06system 0:00.18elapsed 101%CPU (0avgtext+0avgdata 5972maxresident)k
    0inputs+0outputs (0major+905minor)pagefaults 0swaps

    # stat -c %s /usr/bin/ired /usr/bin/grep /usr/bin/rg

    37544
    219248
    5074800

burntsushi · 2024-01-01T13:47:06.000000Z

OK, so first of all, let's get one thing cleared up. What the heck is ired? It isn't in the Archlinux package repos. I found this[1], but it looks like an incomplete and abandoned project. It doesn't even have proper docs:

    $ ired -h
    ired [-qhnv] [-c cmd] [-i script] [-|file ..]
    $ ired --help
    $

So like, I don't even know what `ired -n` is doing. From what I can tell from your commands, it's searching for `string`, but you first need to convert it to a hexadecimal representation.

But okay, let's also check the output between the commands and make sure they're the same. I used my own file:

    $ time grep -ob string 1-2048.txt
    333305:string
    333380:string
    920494:string
    5166701:string
    5210094:string
    6775219:string

    real    0.006
    user    0.006
    sys     0.000
    maxmem  15 MB
    faults  0

    $ time rg -ob string 1-2048.txt
    13123:333305:string
    13124:333380:string
    33382:920494:string
    159885:5166701:string
    161059:5210094:string
    211466:6775219:string

    real    0.003
    user    0.000
    sys     0.003
    maxmem  15 MB
    faults  0

    $ time sh -c "echo -n string|od -An -tx1|sed 's>^>/>;s/ //g'|ired -n 1-2048.txt"

    0x515f9
    0x51644
    0xe0bae
    0x4ed66d
    0x4f7fee
    0x6761b3

    real    0.013
    user    0.010
    sys     0.004
    maxmem  15 MB
    faults  0

Indeed, the hexadecimal offsets printed by ired line up with the offsets printed by grep and ripgrep. Notice also the timing. ired is slower here for me.

OK, now let's do context:

    $ time grep -ob string 1-2048.txt
    [..snip..]
    real    0.006
    user    0.006
    sys     0.000
    maxmem  16 MB
    faults  0

    $ time grep -ob .string 1-2048.txt
    [..snip..]
    real    0.005
    user    0.003
    sys     0.003
    maxmem  16 MB
    faults  0

    $ time grep -ob ..string 1-2048.txt
    [..snip..]
    real    0.006
    user    0.003
    sys     0.003
    maxmem  16 MB
    faults  0

    $ time rg -ob string 1-2048.txt
    [..snip..]
    real    0.004
    user    0.003
    sys     0.000
    maxmem  16 MB
    faults  0
    $ time rg -ob .string 1-2048.txt
    [..snip..]
    real    0.004
    user    0.000
    sys     0.003
    maxmem  16 MB
    faults  0
    $ time rg -ob ..string 1-2048.txt
    [..snip..]
    real    0.004
    user    0.004
    sys     0.000
    maxmem  16 MB
    faults  0

I don't see anything worth saying "yikes" about here.

One possible explanation for the timing differences is that your search has a lot of search results. The match count is a crucial part of benchmarking, and you've made the same mistake as the ugrep author by omitting them. But okay, let me try a search with more hits.

    $ time rg -ob the 1-2048.txt | wc -l
    60509

    real    0.011
    user    0.006
    sys     0.006
    maxmem  16 MB
    faults  0

    $ time rg -ob .the 1-2048.txt | wc -l
    60477

    real    0.014
    user    0.014
    sys     0.000
    maxmem  16 MB
    faults  0

    $ time rg -ob ..the 1-2048.txt | wc -l
    60359

    real    0.014
    user    0.014
    sys     0.000
    maxmem  16 MB
    faults  0

A little slower, but that's what you'd expect with the higher match frequency. Now let's try your script for 1.sh:

    $ echo the | time sh 1.sh 1-2048.txt 6 | wc -l
    63304

    real    0.048
    user    0.072
    sys     0.052
    maxmem  16 MB
    faults  0

    $ echo the | time sh 1.sh 1-2048.txt 7 1 | wc -l
    63336

    real    0.056
    user    0.096
    sys     0.042
    maxmem  16 MB
    faults  0

    $ echo the | time sh 1.sh 1-2048.txt 8 2 | wc -l
    63419

    real    0.053
    user    0.079
    sys     0.049
    maxmem  16 MB
    faults  0

(The counts are a little different because `..the` matches fewer things than `the` when given to grep, but presumably `ired` doesn't care about that.)

But in any case, ired is quite a bit slower here.

OK, let's pop up a level. Your benchmark is somewhat flawed. For three reasons. First is because the timings are so short that the differences here are generally irrelevant to human perception. It reminds me of the time when ripgrep came out, and someone would respond with a "gotcha" that `ag` was faster because it ran a search on a tiny repository in 10ms where as ripgrep took 12ms. That's not quite exactly the same as what's happening here, but it's close. The second is that the haystack is so short that overhead is likely playing a role here. The timings are just too short to be reliable indicators of performance as the haystack size scales. See my commentary on ugrep's benchmarks[2].

Let's try a bigger file:

    $ stat -c %s eigth.txt
    1621035918

    $ file eigth.txt
    eigth.txt: ASCII text

    $ time rg -ob Sherlock eigth.txt | wc -l
    1068

    real    0.154
    user    0.103
    sys     0.050
    maxmem  1551 MB
    faults  0

    $ time rg -ob .Sherlock eigth.txt | wc -l
    935

    real    0.156
    user    0.096
    sys     0.060
    maxmem  1551 MB
    faults  0

    $ time rg -ob ..Sherlock eigth.txt | wc -l
    932

    real    0.154
    user    0.107
    sys     0.047
    maxmem  1551 MB
    faults  0

And now ired:

    $ echo Sherlock | time sh 1.sh eigth.txt 6 | wc -l
    1068

    real    1.393
    user    0.671
    sys     0.729
    maxmem  16 MB
    faults  0

    $ echo Sherlock | time sh 1.sh eigth.txt 7 1 | wc -l
    1201

    real    1.391
    user    0.604
    sys     0.793
    maxmem  16 MB
    faults  0

    $ echo Sherlock | time sh 1.sh eigth.txt 8 2 | wc -l
    1204

    real    1.395
    user    0.578
    sys     0.823
    maxmem  16 MB
    faults  0

Yikes. Over an order of magnitude slower.

Now that the memory usage reported for ripgrep is high just because it's using file backed memory maps. It's not actual heap usage. You can check this by disabling memory maps:

    $ time rg -ob ..Sherlock eigth.txt --no-mmap | wc -l
    932

    real    0.179
    user    0.063
    sys     0.116
    maxmem  16 MB
    faults  0

And if we increase the match frequency on the same large haystack, the gap closes a little, but ired is still about 4x slower:

    $ time rg -ob ..the eigth.txt | wc -l
    13141187

    real    2.470
    user    2.418
    sys     0.050
    maxmem  1551 MB
    faults  0

    $ echo the | time sh 1.sh eigth.txt 8 2 | wc -l
    13894916

    real    10.027
    user    16.293
    sys     8.122
    maxmem  402 MB
    faults  0

I'm not clear on why you're seeing the results you are. It could be because your haystack is so small that you're mostly just measuring noise. ripgrep 14 did introduce some optimizations in workloads like this by reducing match overhead, but I don't think it's anything huge in this case. (And I just tried ripgrep 13 on the same commands above and the timings are similar if a tiny bit slower.)

[1]: https://github.com/radare/ired

[2]: https://github.com/BurntSushi/ripgrep/discussions/2597

1vuio0pswjnm7 · 2024-01-03T22:05:11.000000Z

    curl -A "" https://raw.githubusercontent.com/json-iterator/test-data/0bce379832b475a6c21726ce37f971f8d849513b/large-file.json \
    |tr -d '\12' > test.json

How slow is grep -o for printing a match with surrounding characters

To observe, keep adding characters before the match

    _test1(){
    echo .{$1}'(https:)|(http:)'.{$2};
    n=0;while test $n -le 3;do
    busybox time grep -Eo ".{$1}(https:)|(http:).{$2}" test.json |sed d;
    echo;
    n=$((n+1));
    done;
    }

    _test1 5 10;
    _test1 15 10;
    _test1 25 10;
    _test1 35 10;
    _test1 45 10;
    _test1 55 10;
    _test1 105 10;

Question: Does piping rg output to wc -l affect time(1) output?

Answer: [ ] Yes [ ] No

    _test2(){
    echo .{$1}'(https:)|(http:)'.{$2};
    n=0;while test $n -le 3;do
    busybox time rg -o .{$1}'(https:)|(http:)'.{$2} test.json;
    sleep 2;
    echo;
    echo Now try this with a pipe to wc -l...;
    echo;
    sleep 2
    busybox time rg -o .{$1}'(https:)|(http:)'.{$2} test.json |wc -l >/dev/null;
    echo;
    sleep 5;
    echo;
    n=$((n+1));
    done;
    }

    _test2 150 10;

burntsushi · 2024-01-03T22:24:49.000000Z

> Does piping rg output to wc -l affect time(1) output?

Oh yes absolutely! If `rg` is printing to a tty, it will automatically enable showing line numbers and printing with colors. Both of those have costs (over and beyond just printing matches and their byte offsets) that appear irrelevant to your use case. Neither of those things are done by ired. It's not about `wc -l` specifically, but about piping into anything. And of course, with `wc -l`, you avoid the time needed to actually render the results. But I used `wc -l` with ired too, so I "normalized" the benchmarking model and simplified it.

But either way, my most recent comment before this one capitulated to your demands and avoided the use of piping results into anything. It was for this reason that I showed commands with `--color=never -N`.

And yes, `grep -Eo` gets slower with more `.`. ripgrep does too, but is a bit more robust than GNU grep. I already demonstrated this in my most recent previous comment and even wrote some words about it explicitly and that regex engines can't typically do as well with increasing window sizes like this when compared to a purpose built tool like ired. Nevertheless, ired is still slower than ripgrep in most of the tests I showed in my previous comment.

But optimally speaking, could something even faster than both ired and ripgrep be built? I believe so, yes. But only for some workloads, I suspect, with high match frequency. And ain't nobody going to build such a thing just to save a few milliseconds. Lol. The key is really to implement the windowing explicitly instead of relying on the regex engine to do it for you. Alternatively, one could add a special optimization pass to the regex engine that recognizes "windowing" patterns and does something clever. I have a ticket open for something similar[1].

[1]: https://github.com/rust-lang/regex/issues/802

burntsushi · 2024-01-03T22:38:53.000000Z

I tried your `_test2` and `rg` without `wc -l` takes 0.02s while with `wc -l` it takes 0.01s. The difference is meaningless. I don't believe you if you say that impacts your edit-compile-run cycle.

1vuio0pswjnm7 · 2024-01-05T04:55:38.000000Z

Below are the results I got from _test2.

    + echo .{150}(https:)|(http:).{10}
    + n=0
    + test 0 -le 3
    + busybox time rg -o .{150}(https:)|(http:).{10} test.json
    real 1m 33.37s
    user 0m 1.25s
    sys  0m 2.97s
    + sleep 2
    + echo
    + echo Now try this with a pipe to wc -l...
    + echo
    + sleep 2
    + busybox time+  rg -o .{150}(https:)|(http:).{10} test.json
    wc -l
    real 0m 0.49s
    user 0m 0.45s
    sys  0m 0.02s
    + echo
    + sleep 5
    + echo
    + n=1
    + test 1 -le 3
    + busybox time rg -o .{150}(https:)|(http:).{10} test.json
    real 1m 34.23s
    user 0m 1.75s
    sys  0m 4.22s
    + sleep 2
    + echo
    + echo Now try this with a pipe to wc -l...
    + echo
    + sleep 2
    + busybox time rg -o .{150}(https:)|(http:).{10} test.json
    wc -l
    real 0m 0.40s
    user 0m 0.37s
    sys  0m 0.02s
    + echo
    + sleep 5
    + echo
    + n=2
    + test 2 -le 3
    + busybox time rg -o .{150}(https:)|(http:).{10} test.json
    real 1m 33.59s
    user 0m 1.05s
    sys  0m 1.76s
    + sleep 2
    + echo
    + echo Now try this with a pipe to wc -l...
    + echo
    + sleep 2
    + busybox time rg -o .{150}(https:)|(http:).{10}+  test.json
    wc -l
    real 0m 0.45s
    user 0m 0.37s
    sys  0m 0.04s
    + echo
    + sleep 5
    + echo
    + n=3
    + test 3 -le 3
    + busybox time rg -o .{150}(https:)|(http:).{10} test.json
    real 1m 33.99s
    user 0m 1.93s
    sys  0m 4.82s
    + sleep 2
    + echo
    + echo Now try this with a pipe to wc -l...
    + echo
    + sleep 2
    + busybox time rg -o .{150}(https:)|(http:).{10} test.json
    wc -l
    real 0m 0.40s
    user 0m 0.37s
    sys  0m 0.02s
    + echo
    + sleep 5
    + echo
    + n=4
    + test 4 -le 3
    + exit

No, I am not going to stare at the screen for a minute and half as thousands of matches are displayed. (In fact I am unlikely to even be examining a file of this size. It's more likely to be under 6M.) With a file this size what I would do is examine a sample of the matches, let's say for example the first 20.

Look at the speed of a shell script with 9 pipes, using ired 3x to examine the first 20 matches.

    busybox time sh -c "echo -n https:
    |od -tx1 -An \
    |tr -d '\40' \
    |sed 's>^>/>' \
    |ired -n test.json \
    |sed  '1,21s/.*/s&@s-150@b166@X/;21q' \
    |tr @ '\12' \
    |ired -n test.json \
    |sed 's/.*/w&0a/' \
    |ired -n /dev/stdout"

    real 0m 0.01s
    user 0m 0.00s
    sys 0m 0.00s

That speed is just right.

Now with the same ripgrep.

    busybox time rg -o .{150}https:.{10} test.json|sed 20q 

    real 0m 0.40s
    user 0m 0.37s
    sys 0m 0.02s

Being used to the speed of ired, this is slow.

Even more, ired is fraction of the size of ripgrep.

For me, the choice of what to use is easy, until I discover something better than ired.

burntsushi · 2024-01-05T17:06:41.000000Z

    $ busybox time rg -o '.{150}(https:)|(http:).{10}' test.json
    real    0m 0.02s
    user    0m 0.01s
    sys     0m 0.00s

I don't know how you're getting over a minute for that. Perhaps you aren't doing a good enough job at providing a reproduction.

> That speed is just right.

... unlike its results. "Wrong and fast, just the way I like it!" Lmao.

1vuio0pswjnm7 · 2024-01-03T03:33:10.000000Z

Not sure where "wc -l" came from. It was not in any of the tests I authored. That's because I am not interested in line counts. Nor am I interested in very large files either, or ASCII files of Shakespeare, etc. As I stated in the beginning, I am working with files that are like "a wall of text". Long lines, few if any linebreaks. Not the type of files that one can read or edit using less(1), ed(1) or vi(1).

What I am interested in is how fast the results display on the screen. For files of this type in the single digit MB range, piping the results to another program does not illustrate the speed of _displaying results to the screen_. In any event, that's not what I'm doing. I am not piping to another program. I am not looking for line counts. I am looking for patterns of characters and I need to see these on the screen. (If I wanted line counts I would just use grep -c -o.)

When working with these files interactively at the command line, performing many consecutive searches,^1 the differences in speed become readily observable. Without need for time(1). grep -o is ridiculously slow. Hence I am always looking for alternatives. Even a shell script with ired is faster than grep. Alas, ripgrep is not a solution either. It's not any faster than ired _at this task for files of this type and size_.

Part of the problem in comparing results is that we are ignoring the hardware. For example, I am using a small, low resource, underpowered computer; I imagine most software developers use more expensive computers that are much more powerful with vast amounts of resources.

Try some JSON as a sample but note this is not necessarily the best example of the "walls of text" I am working with; ones that do not necessarily conform to a standard.

   curl "https://api.crossref.org/works?query=unix&rows=1000" > test.json

   busybox time grep -Eo .{100}https:.{50}
   real 0m 7.93s
   user 0m 7.81s
   sys  0m 0.02s

This is still a tiny file in the world of software developers. Obviously, if this takes less than 1 second on some developer machine, then any comparison with me, an end user with an ordinary computer, are not going to make much sense.

1. Not an actual "loop", but an iterative, loop-like process of search file, edit program, compile program, run program, search file, edit program, ...

With this, speed becomes noticeable even if the search task is relatively short-lived.

burntsushi · 2024-01-03T03:48:45.000000Z

Well then can you share such a file? I wasn't measuring the time of wc. I just did that to confirm the outputs were the same. The fact is that I can't reproduce your timings, and ired is significantly slower in the tests I showed above.

I tried my best to stress the importance of match frequency, and even varied the tests on that point. Yet I am still in the dark as to the match frequency in your tests.

The timing differences even in your tests also seem insignificant to me, although they can be significant in a loop or something. Hence the reason I used a larger file. Otherwise the difference in wall time appears to be a matter of milliseconds. Why does that matter? Maybe I'm reading your timings wrong, but that would only deepen the mystery as to why our results are so different. Hence my request for an input that your care about so that we can get on the same page.

Not sure if it was clear or not, but I'm the author of ripgrep. The benefit to you from this exchange is that I should be able to explain why the perf difference has to exist (or is difficult to remedy) or file a TODO for making rg -o faster.

1vuio0pswjnm7 · 2024-01-04T00:06:38.000000Z

Another iterative loop-like procedure is search, adjust pattern and/or amount of context, search, adjust pattern and/or amount of context, search, ...

If a program is sluggish, I will notice.

The reason I am searching for a pattern is because there is something I consider meaningful that follows or precedes it. Repeating patterns would generally not be something I am interested in. For example, a repeating pattern such as "httphttphttp". The search I would do would more likely be "http". If for some reason it repeats, then I will see that in the context.

For me, neither grep nor grep clones are as useful as ired. ired will show me the context including the formatting, e.g., spaces, carriage returns. It will print the pattern plus context to the screen exactly as it appears in the file, also in hexdump or formatted hex, like xxd -p.

https://news.ycombinator.com/item?id=38564604

And it will do all this faster than grep -o and nearly as fast as a big, fat grep clone in Rust that spits out coloured text by default, even when ired is in a shell script with multiple pipes and other programs. TBH, neither grep nor grep clones are as flexible; they are IMO not suitable for me, for this type of task. But who knows there may be some other program I do not know about yet.

Significance can be subjective. What is important to me may not be important to someone else, and vice versa. Every user is different. Not every user is using the same hardware. Nor is every user trying to do the exact same things with their computer.

For example, I have tried all the popular UNIX shells. I would not touch zsh with a ten-foot pole. Because I can feel the sluggishness compared to working in dash or NetBSD sh. I want something smaller and lighter. I intentionally use the same shell for interactive and non-interactive use. Because I like the speed. But this is not for everyone. Some folks might like some other shell, like zsh. Because [whatever reasons]. That does not mean zsh is for everyone, either. Personally, I would never try to proclaim that the reasons these folks use zsh are "insignificant". To those users, those reasons are significant. But the size and speed differences still exist, whether any particular user deems them "significant" or not.

burntsushi · 2024-01-04T00:24:06.000000Z

> If a program is sluggish, I will notice.

Well yes of course... But you haven't demonstrated ripgrep to be sluggish for your use case.

> For me, neither grep nor grep clones are as useful as ired. ired will show me the context including the formatting, e.g., spaces, carriage returns. It will print the pattern plus context to the screen exactly as it appears in the file, also in hexdump or formatted hex, like xxd -p.

Then what are you whinging about? grep isn't a command line hex editor like ired is. You're the one who came in here asking for grep -o to be faster. I never said grep (or ripgrep) could or even should replace ired in your workflow. You came in here talking about it and making claims about performance. At least for ripgrep, I think I've pretty thoroughly debunked your claims about perf. But in terms of functionality, I have no doubt whatsoever that ired is better fitted for the kinds of problems you talk about. Because of course it is. They are two completely different tools.

ired will also helpfully not report all substring results. I love how you just completely ignore the fact that your useful tool is utterly broken. I don't mean "broken" lightly. It has had hidden false negatives for 14 years. Lmao. YIKES.

> they are IMO not suitable for this type of task

Given that ripgrep gives the same output as your ired shell script (with a lot less faffing about) and it does it faster than ired, I find this claim baseless and without evidence. Of course, ripgrep will not be as flexible as ired for other hex editor use cases. Because it's not a hex editor. But for the specific case you brought up on your own accord because you wanted to come complain on an Internet message board, ripgrep is pretty clearly faster.

> nearly as fast as a big, fat grep clone in Rust

At least it doesn't have a 14 year old bug that can't find ABAB in ABAABAB. Lmao.

But I see the goalposts are shifting. First it was speed. Now that that has been thoroughly debunked, you're whinging about binary size. I never made any claims about that or said that ripgrep was small. I know it's fatter than grep (although your grep is probably dynamically linked). If people like you want to be stingy with your MBs to the point that you won't use ripgrep, then I'm absolutely cool with that. You can keep your broken software.

> Significance can be subjective. What is important to me may not be important to someone else, and vice versa. Every user is different.

A trivial truism, and one that I've explicitly acknowledged throughout this discussion. I said that milliseconds in perf could matter for some use cases, but it isn't going to matter in a human paced iteration workflow. At least, I have seen no compelling argument to the contrary.

It's even subjective whether or not you care if your tool has given you false negatives because of a bug that has existed for 14 years. Different strokes for different folks, amiright?

burntsushi · 2024-01-03T04:00:51.000000Z

Okay I see you replied with more details elsewhere. I'll investigate tomorrow when I have hands on a keyboard. Thanks.

joshka · 2023-12-30T23:18:17.000000Z

There's a few ripgrep based tuis:

- https://github.com/acheronfail/repgrep

- https://github.com/konradsz/igrep

- https://github.com/seg-mx/grep_tui

- https://github.com/Robertleoj/grepedit

nsagent · 2023-12-31T03:19:29.000000Z

You can also use fzf with ripgrep to great effect:

[1]: https://github.com/junegunn/fzf/blob/master/ADVANCED.md#usin...

comex · 2023-12-30T23:09:26.000000Z

Interesting, it supports an n-gram indexer. ripgrep has had this planned for a few years now [1] but hasn't implemented it yet. For large codebases I've been using csearch, but it has a lot of limitations.

Unfortunately... I just tried the indexer and it's extremely slow on my machine. It took 86 seconds to index a Linux kernel tree, while csearch's cindex tool took 8 seconds.

[1] https://github.com/BurntSushi/ripgrep/issues/1497

dtgriscom · 2023-12-31T00:01:27.000000Z

That's close to a gig of disk reads; I trust you didn't try ugrep first and then cindex second, without taking into account caching.

comex · 2023-12-31T03:08:35.000000Z

I ran both multiple times, alternating (and making sure to clean out the indexes in between). Results were reasonably consistent across runs.

jgalt212 · 2023-12-31T13:37:37.000000Z

If you're gonna go the csearch route, you should also consider hound. I use it many times per day.

https://github.com/hound-search/hound

bishfish · 2023-12-31T17:27:55.000000Z

It creates per-directory index files on its first run. ugrep-indexer is also labeled as beta. A couple of relevant quotes from its GitHub site:

“Indexing adds a hidden index file ._UG#_Store to each directory indexed.”

“Re-indexing is incremental, so it will not take as much time as the initial indexing process.”

o11c · 2023-12-30T21:54:26.000000Z

Important note: not actually compatible. It took me seconds to find an option that does something completely different than the GNU version.

burntsushi · 2023-12-30T22:23:35.000000Z

Indeed. And here are some concrete examples around locale:

    $ grep -V | head -n1
    grep (GNU grep) 3.11
    $ alias ugrep-grep="ugrep-4.4.1 -G -U -Y -. --sort -Dread -dread"
    $ echo 'pokémon' | LC_ALL=en_US.UTF-8 grep 'pok[[=e=]]mon'
    pokémon
    $ echo 'pokémon' | LC_ALL=en_US.UTF-8 ugrep-grep 'pok[[=e=]]mon'
    $ echo 'γ' | LC_ALL=en_US.UTF-8 grep -i 'Γ'
    γ
    $ echo 'γ' | LC_ALL=en_US.UTF-8 ugrep-grep -i 'Γ'

BSD grep works like GNU grep too:

    $ grep -V
    grep (BSD grep, GNU compatible) 2.6.0-FreeBSD
    $ echo 'pokémon' | LC_ALL=en_US.UTF-8 grep 'pok[[=e=]]mon'
    pokémon
    $ echo 'γ' | LC_ALL=en_US.UTF-8 grep -i 'Γ'
    γ

fwip · 2023-12-30T22:04:55.000000Z

Which option is that? I'm scanning the ugrep page, but nothing is popping out to me.

e12e · 2023-12-30T22:20:02.000000Z

I would assume compatible meant posix/bsd - unless explicitly advertised AS "GNU grep compatible"?

burntsushi · 2023-12-30T22:23:54.000000Z

From the OP: "Ugrep is compatible with GNU grep and supports GNU grep command-line options."

zaidhaan · 2023-12-31T06:10:04.000000Z

A little off-topic, but I'd love to see a tool similar to this that provides real-time previews for an entire shell pipeline which, most importantly, integrates into the shell. This allows for leveraging the completion system to complete command-line flags and using the line editor to navigate the pipeline.

In zsh, the closest thing I've gotten to this was to bind Ctrl-\ to the `accept-and-hold` zle widget, which executes what is in the current buffer while still retaining it and the cursor position. That gets me close (no more ^P^B^B^B^B for editing), but I'd much rather see the result of the pipeline in real-time rather than having to manually hit a key whenever I want to see the result.

wazzaps · 2023-12-31T07:03:32.000000Z

Sounds similar to this: https://github.com/akavel/up

tacone · 2023-12-31T13:08:37.000000Z

I guess Alt+a is the default zsh shortcut for that.

ijustlovemath · 2023-12-31T04:52:01.000000Z

Any particular reason why newer tools don't follow the well-established XDG standard for config files? Those folder structures probably already exist on end user machines, and keep your home directory from getting cluttered with tens of config files

xcdzvyn · 2023-12-31T04:58:27.000000Z

Slight rant/aside but Firefox is bad for this. You can point it to a custom profile path (e.g. .config/mozilla) but ~/.mozilla/profile.ini MUST exist. Only that one file - you can move everything else.

ijustlovemath · 2023-12-31T14:28:21.000000Z

In my mind, this is fine, as Firefox predates the standard by a long time. But newer tools specifically should know better.

tedunangst · 2023-12-31T06:30:46.000000Z

XDG isn't recognized as an authority outside of XDG.

burntsushi · 2023-12-31T04:56:56.000000Z

For ripgrep at least, you set an environment variable telling it where to look for a config file. You can put it anywhere, so you don't need to put it in $HOME.

I didn't do XDG because this route seemed simpler, and XDG isn't something that is used everywhere.

smaudet · 2023-12-31T17:57:18.000000Z

Standard should be - tool tells you where it's configured, how to change the config, and choose a 'standard' default config, such as XDG.

Assuming you aren't doing weird things with paths, I can work around 'dumb lazy' developers releasing half-assed tools with symlinks/junctions, but I really don't want to spend a ton of time configuring your tool or fighting its presumptions.

burntsushi · 2023-12-31T18:09:36.000000Z

Oh okay, I guess you've got it figured out. Now specify it in enough detail for others to implement it, get all stakeholders to agree and get everyone to implement it exactly to the spec.

Good luck. You're already off to a rough start with XDG, since that isn't what is used on Windows. And it's unclear whether it ought to be used on macOS.

Hendrikto · 2023-12-31T11:18:54.000000Z

> I didn't do XDG because this route seemed simpler

Simpler how? This requires custom config, instead of following what I set system-wide.

> and XDG isn't something that is used everywhere.

Yeah, that‘s why it defines defaults to fall back on.

burntsushi · 2023-12-31T12:45:40.000000Z

It's far simpler to implement.

No, you don't understand. I'm not saying The XDG variables might not be defined. Give me a little credit here lol. I have more than a passing familiarity with XDG. I've implemented it before. I'm saying the XDG convention itself may not apply. For example, Windows. And its controversial whether to use them on macOS when I last looked into it.

I don't see any significant problem with defining an environment variable. You likely already have dozens defined. I know I do.

I'm not trying to convince you of anything. Someone asked why. This is why for ripgrep at least.

ijustlovemath · 2023-12-31T15:49:18.000000Z

Could ripgrep not simply add a check for the XDG environment variables and use those, if no rg environment variable is given? Of course if both are not available you would use the default.

burntsushi · 2023-12-31T15:55:57.000000Z

Of course. But now you've complicated how config files are found and it doesn't seem like an improvement big enough to justify it.

Bottom line is that while ripgrep doesn't follow XDG, it also doesn't force you to litter your HOME directory. That's what most people care about in my experience.

I would encourage you to search the ripgrep issue tracker for XDG. This has all been discussed.

smaudet · 2023-12-31T18:07:31.000000Z

The issue is complexity - we could create some sort or 'standard tool' library that 'just works' on all platforms, but now building the tool and runtime bootstrapping the tool become more complex, and hence more likely to _break_.

Really most people want it in their path and it just to work in as many scenarios as possible. Config almost shouldn't be the responsibility of the tool at all... (Options passed to the tool via env variables, perhaps)...

Joel_Mckay · 2023-12-31T01:09:38.000000Z

Someone please just standardize the grep flags across all platforms.

Specifically -P / --perl-regexp support on MacOS and FreeBSD

It really would reduce the WTF moments for the students.

Insert jokes about standards below... =)

burntsushi · 2023-12-31T01:13:44.000000Z

That's what POSIX was supposed to be.

It's easier IMO to just use the same tool on all platforms. Which you can of course do.

Joel_Mckay · 2023-12-31T01:25:32.000000Z

Not sure if brew's grep is as NERF'ed, but POSIX standard often is just a subset of minimal features for the GNU version.

Cheers, =)

burntsushi · 2023-12-31T01:30:07.000000Z

Yes, that's the problem. You need to maintain a close attention level to know which things are POSIX. And in the case of GNU grep, you actually need to set POSIXLY_CORRECT=1. Otherwise its behavior is not a subset.

POSIX also forbids greps from searching UTF-16 because it mandates that certain characters always use a single byte. ripgrep, for example, doesn't have this constraint and thus can transparently search UTF-16 correctly via BOM sniffing.

karakanb · 2023-12-31T09:48:39.000000Z

Slightly off topic, but how does one publish so many installable versions of a binary across all the package managers? I figured out how to do it for Brew, but the rest seems like a billion different steps that need to be done and I feel like I am missing something.

wint3rmute · 2023-12-31T11:43:48.000000Z

You only have to set up CI/CD once for each package type, afterwards all the packaging work is done for you automatically.

Ripgrep is also quite a large project (judging on both star count and contribution cout), so people probably volunteer to support their platform/package manager of choice.

mathverse · 2023-12-30T21:47:34.000000Z

Also look at https://github.com/stealth/grab from Sebastian Krahmer.

meowface · 2023-12-31T02:15:11.000000Z

ripgrep, grab, ugrep, hypergrep... Any of the four are probably fast enough for any of my use cases but I suddenly feel tempted to micro-optimize and spend ages comparing them all.

infamia · 2023-12-30T21:35:43.000000Z

Ugrep is also available in Debian based repos, which is super nice.

louwrentius · 2023-12-31T12:44:15.000000Z

I will never learn this tool

I will not even contemplate using this tool.

The reason is very simple: I can trust 'grep' to be on any system I ever touch. Learning ugrep doesn't make any sense as I can't trust it to be available.

I could still use it on my own systems, but I work on customer systems which won't have this tool installed.

And I'm proficient enough with grep that it's 'good enough', I'm not focussing on a better grep. I'm focussing on fixing a problem, or trying something new.

I'd rather invest my time into something that will benefit me across all environments I work with.

Because a tool may be 'better' (whatever that means) doesn't mean it will see adoption.

This is not about being closeminded, but it's about focus on what's really important.

jftuga · 2023-12-31T01:06:29.000000Z

I really like the fuzzy match feature. Useful for typos or off by 1-2 characters.

https://github.com/Genivia/ugrep#fuzzy

jraph · 2023-12-30T21:49:24.000000Z

Okay, this solves a feature I was occasionally missing for a long time: searching for several terms in files (the "Googling files" feature). I wrote a 8 line script a few weeks ago to do this, that I will gladly throw away. I'll look into the TUI too.

(I've been using ripgrep for quite some time now, how does this otherwise compare to it? would I be able to just replace rg with ug?)

Levitating · 2023-12-30T23:16:46.000000Z

Is it that different from using fzf?

I currently use ripgrep-all (which can search into anything, video captions or pdfs) and fzf.

bishfish · 2023-12-31T02:06:12.000000Z

ugrep+ has this feature similar to ripgrep-all.

For regular use, I use ugrep’s %u option with its format feature to only get one match per line same as other grep tools.

Overall, I’m a happy user of ugrep. ugrep works as well as ripgrep for me. It’s VERY fast and has built-in option to search archives within archives recursively.

dmlerner · 2023-12-30T21:43:54.000000Z

Why not ripgrep?

GuB-42 · 2023-12-31T00:20:47.000000Z

Why not ugrep?

They are more or less equivalent. One has obscure feature X other has obscure feature Y, one is a bit faster on A, other is a bit faster on B, the defaults are a bit different, and one is written in Rust, the other in C++.

Pick the one you like, or both. I have both on my machine, and tend to use the one that does what I want with the least options. I also use GNU grep when I don't need the speed or features of either ug and rg.

tredre3 · 2023-12-30T22:34:50.000000Z

One thing I never liked about ripgrep is that it doesn't have a pager. Yes, it can be configured to use the system-wide ones, but it's an extra step (and every time I have to google how to preserve colors) and on Windows you're SOL unless you install gnu utils or something. The author always refused to fix that.

Ugrep not only has a pager built in, but it also allows searching the results which is super nice! And that feature works on all supported platforms!

bornfreddy · 2023-12-30T22:48:48.000000Z

Interesting - for me a built-in pager is an antifeature. I don't want to figure out how to leave the utility. Worst of all, pager usually means that sometimes you get more pages and you need to press q to exit, and sometimes not. Annoying. I often type yhe next command right away and the pager means I get stuck, or worse, pager starts doing something in response to my keys (looking at you, `git log`).

Then again I'm on Linux and can always pipe to less if I need to. I'm also not the target audience for ugrep because I've never noticed that grep would be slow. :shrug:

amethyst · 2023-12-31T00:58:26.000000Z

You might appreciate setting `PAGER=cat` in your environment. ;)

Git obeys that value, and I would hope that most other UNIXy terminal apps do too.

bornfreddy · 2023-12-31T08:10:33.000000Z

Oh, wow, thank you! I must try this.

VTimofeenko · 2023-12-31T00:41:01.000000Z

Some terminal emulators (kitty for sure) support "open last command output in pager". Works great with a pager that can understand ANSI colors - less fussing around with variables and flags to preserve colors in the pager

burntsushi · 2023-12-30T22:46:56.000000Z

This is what I do personally:

    $ cat ~/bin/rgp
    #!/bin/sh
    exec rg -p "$@" | less -RFX

Should work just fine. For Windows, you can install `bat` to use a pager if you don't otherwise have one. You don't need GNU utils to have a pager.

anjanb · 2023-12-31T02:19:44.000000Z

hi @burntsushi,

   fan of your tool. like it's speed and defaults.

I use windows : didn't understand what you mean by "install `bat`" to use a pager.

I use cygwin and WSL for my unix needs. I have more and less in cygwin for use in windows.

burntsushi · 2023-12-31T02:25:14.000000Z

I referenced bat because I've found that suggesting cygwin sometimes provokes a negative reaction. The GP also mentioned needing to install GNU tooling as if it were a negative.

bat is fancy pager written in Rust. It's on GitHub: https://github.com/sharkdp/bat

anjanb · 2023-12-31T04:17:39.000000Z

I'm sure you know but windows command prompt always came with its inbuilt pager -- more. So, you could always do "dir | more" or "rg -p "%*" | more ". (more is good with colors without flags)

burntsushi · 2023-12-31T04:54:05.000000Z

I didn't! I'm not a Windows user. Colors are half the battle, so that's good. Will it only appear if paging is actually needed? That's what the flags to `less` do in my wrapper script above. They are rather critical for this use case.

ilyagr · 2023-12-31T08:23:23.000000Z

I don't believe bat is a paper; it's more of a pretty-printer that tends to call less.

Two pallets that should work on Windows are https://github.com/walles/moar (golang) and https://github.com/markbt/streampager (Rust). There might also be a newer one that uses rust, I'm unsure.

ttyprintk · 2024-01-01T02:10:41.000000Z

I'd recommend ov for Windows users.

https://github.com/noborus/ov

bat on Windows does page, but I believe it's only available on Choco and not winget.

ilyagr · 2024-01-02T01:40:44.000000Z

Good find, thanks! I'll check if I prefer it to moar.

As for bat, according to https://github.com/sharkdp/bat#using-bat-on-windows, the Chocolatey package simply installs `less` alongside `bat`. Seems like a good idea, but I haven't tried it.

ttyprintk · 2024-01-05T02:53:05.000000Z

Ah, thanks for doing the footwork.

MrDrMcCoy · 2023-12-31T02:21:24.000000Z

For me, it's a lot easier to compile a static binary of a C++ app than a Rust one. Never got that to work. Also nice to have compatibility with all of grep's arguments.

datadeft · 2023-12-31T12:52:05.000000Z

> to compile a static binary

Cargo is one of the main reasons to use Rust of C++. I am pretty sure there is more involved with C++ than this:

   rustup target add x86_64-unknown-linux-musl 
   cargo build --target=x86_64-unknown-linux-musl

devraza · 2023-12-30T21:48:56.000000Z

From the ugrep README:

For an up-to-date performance comparison of the latest ugrep, please see the ugrep performance benchmarks [at https://github.com/Genivia/ugrep-benchmarks]. Ugrep is faster than GNU grep, Silver Searcher, ack, sift. Ugrep's speed beats ripgrep in most benchmarks.

codetrotter · 2023-12-30T21:57:54.000000Z

Does these performance comparison take into account the things BurntSushi (ripgrep author) pointed out in the ripgrep issue link elsewhere ITT? https://github.com/BurntSushi/ripgrep/discussions/2597

Either way, ripgrep is awesome and I’m staying with it.

devraza · 2023-12-31T00:02:48.000000Z

Agreed - ripgrep is great, and I'm not planning to switch either. The performance improvement is tiny, anyways.

Conscat · 2023-12-30T22:19:08.000000Z

The best practical reason to choose this is its interactive features, like regexp building.

philkrylov · 2024-01-01T18:13:14.000000Z

Although being faster in some cases, ripgrep lacks archive search support (no, transparent decompression ignoring the archive structure is not enough) which works great in ugrep.

0cf8612b2e1e · 2023-12-30T21:49:10.000000Z

I assume the grep compatible bit is attractive to some people. Not me, but they exist.

derriz · 2023-12-30T21:57:07.000000Z

I find myself returning to grep from my default of rg because I'm just too lazy to learn a new regex language. Stuff like word boundaries "\<word\>" or multiple patterns "$one\|two$".

masklinn · 2023-12-30T22:03:02.000000Z

That seems like the weirdest take ever: ripgrep uses pretty standard PCRE patterns, which are a lot more common than posix’s bre monstrosity.

To me the regex langage is very much a reason to not use grep.

derriz · 2023-12-30T22:29:59.000000Z

A bit hyperbolic, no?

If you consider it "the weirdest ever", I'm guessing that I'm probably older than you. I've certainly been using regex long before PCRE became common.

As a vim user I compose 10s if not 100s of regexes a day. It does not use PCRE. Nor does sed, a tool I've been using for decades. Do you also recommend not using these?

comex · 2023-12-30T23:23:01.000000Z

I use all of those tools but the inconsistency drives me crazy as it's hard to remember which syntax to use where. Here's how to match the end of a word:

ripgrep, Python, JavaScript, and practically every other non-C language: \b

vim: \>

BSD sed: [[:>:]]

GNU sed, GNU grep: \> or \b

BSD grep: \>, \b, or [[:>:]]

less: depends on the OS it's running on

burntsushi · 2023-12-30T23:31:19.000000Z

Did you know that not all of those use the same definition of what a "word" character is? Regex engines differ on the inclusion of things like \p{Join_Control}, \p{Mark} and \p{Connector_Puncuation}. Although in the case of \p{Connector_Punctuation}, regex engines will usually at least include underscore. See: https://github.com/BurntSushi/rebar/blob/f9a4f5c9efda069e798...

And then there's \p{Letter}. It can be spelled in a lot of ways: \pL, \p{L}, \p{Letter}, \p{gc=Letter}, \p{gc:Letter}, \p{LeTtEr}. All equivalent. Very few regex engines support all of them. Several support \p{L} but not \pL. See: https://github.com/BurntSushi/rebar/blob/f9a4f5c9efda069e798...

pbhjpbhj · 2023-12-30T22:10:34.000000Z

`pgrep`, or `grep -P`, uses PCRE though, AFAIUI.

burntsushi · 2023-12-30T22:26:41.000000Z

ripgrep's regex syntax is pretty similar to grep -E. So if you know grep -E, most of that will transfer over.

Also, \< and \> are in ripgrep 14. Although you usually just want to use the -w/--word-regexp flag.

xoranth · 2023-12-30T22:48:36.000000Z

> Also, \< and \> are in ripgrep 14

Isn't that inconsistent with the way Perl's regex syntax was designed? In Perl's syntax an escaped non-ASCII character is always a literal [^1], and that is guaranteed not to change.

That's nice for beginners because it saves you from having to memorize all the metacharacters. If you are in doubt you on whether something has a special meaning, you just escape it.

[^1]: https://perldoc.perl.org/perlrebackslash#The-backslash

burntsushi · 2023-12-30T23:05:08.000000Z

Yes, it's inconsistent with Perl. But there are many things in ripgrep's default regex engine that are inconsistent with Perl, including the fact that all patterns are guaranteed to finish a search in linear time with respect to the haystack. (So no look-around or back-references are supported.) It is a non-goal of ripgrep to be consistent with Perl. Thankfully, if you want that, then you can get pretty close by passing the -P/--pcre2 flag.

With that said, I do like Perl's philosophy here. And it was my philosophy too up until recently. I decided to make an exception for \< and \> given their prevalence.

It was also only relatively recently that I made it possible for superfluous escapes to exist. Prior to ripgrep 14, unrecognized escapes were forbidden:

    $ echo '@' | rg-13.0.0 '\@'
    regex parse error:
        \@
        ^^
    error: unrecognized escape sequence
    $ echo '@' | rg '\@'
    @

I had done it this way to make it possible to add new escape sequences in a semver compatible release. But in reality, if I were to ever add new escape sequences, it use one of the ascii alpha-numeric characters, as Perl does. So I decided it was okay to forever and always give up the ability to make, e.g., `\@` mean something other than just matching a literal `@`.

`\<` and `\>` are forever and always the lone exceptions to this. It is perhaps a trap for beginners, but there are many traps in regexes, and this seemed worth it.

Note that `\b{start}` and `\b{end}` also exist and are aliases for `\<` and `\>`. The more niche `\b{start-half}` and `\b{end-half}` also exist, and those are what are used to implement the -w/--word-regexp flag. (Their semantics match GNU grep's -w/--word-regexp.) For example, `\b-2\b` will not match in `foo -2 bar` since `-` is not a word character and `\b` demands `\w` on one side and `\W` on the other. However, `rg -w -e -2` will match `-2` in `foo -2 bar`:

    $ echo 'foo -2 bar' | rg -w -e '\b-2\b'
    $ echo 'foo -2 bar' | rg -w -e -2
    foo -2 bar

xoranth · 2023-12-30T23:31:36.000000Z

Ok, makes sense. And thanks for the detailed explaination about word boundaries and the hint about the --pcre flag (I hadn't realized it existed).

jedisct1 · 2023-12-31T11:27:18.000000Z

Fuzzy matching is the main reason I switched to ugrep. This is insanely useful.

meindnoch · 2023-12-30T22:00:18.000000Z

Because this is faster?

bsdpufferfish · 2023-12-30T22:07:43.000000Z

ripgrep stole the name but doesn’t follow the posix standard.

ww520 · 2023-12-30T22:20:11.000000Z

Just tried it out. It's blazingly fast. The interactive TUI search is pretty sweet.

tarun_anand · 2023-12-31T08:22:45.000000Z

Very insightful discussion. Is there a regex library that is tuned for in-memory data/strings? Similar to in-memory databases?

I recall using hyperscan but isn't it discontinued.

nineteen96 · 2023-12-31T08:12:46.000000Z

this is slick! easily the best of these new grep tools. thanks for sharing. i’ll use this when grep(1) doesn’t quite cut it

Sparkyte · 2023-12-31T00:50:03.000000Z

Cool, but in a real life scenario where the system is not able to pull from external packages because it is in a secured environment makes myself think this is moot as you'll be out of practice of actually running grep. I would avoid not staying out of practice with grep.

On the other hand for a non-work environment where security isn't in question this is cool.

seanp2k2 · 2023-12-31T00:53:08.000000Z

This. I had to beg and wait about a year to get jq added to our base image once it passed sec review and all that.

kyawzazaw · 2023-12-31T01:34:36.000000Z

I find bat pretty useful on my local machine

jedberg · 2023-12-31T01:05:53.000000Z

I feel like if you're going to make a new grep and put a web page for it, your webpage should start with why your grep is better than the default (or all the other ones).

Why did you build a new grep?

infamia · 2023-12-31T02:30:41.000000Z

> I feel like if you're going to make a new grep and put a web page for it, your webpage should start with why your grep is better than the default (or all the other ones).

No snark here, but is the subtitle not enough to start? "a more powerful, ultra fast, user-friendly, compatible grep"

jedberg · 2023-12-31T05:57:32.000000Z

Not really.

* a more powerful -- This is meaningless without some sort of examples. Powerful how? What does it do that's better than grep?

* ultra fast -- This at least means something, but it should be quantified in some way. "50%+ faster for most uses cases" or something like that.

* user-friendly -- not even sure what this means. Seems kind of subjective anyway. I find grep plenty user friendly, for a command line tool.

* compatible grep -- I mean, they all are pretty much, but I guess it's good to know this?

infamia · 2023-12-31T07:47:44.000000Z

> * ultra fast -- This at least means something, but it should be quantified in some way. "50%+ faster for most uses cases" or something like that.

That would be begging for nerd rage posts, just like so many disputing the benchmarks. >:D

> * user-friendly -- not even sure what this means. Seems kind of subjective anyway. I find grep plenty user friendly, for a command line tool.

Just below is a huge, captioned screenshot of the TUI?

> * compatible grep -- I mean, they all are pretty much, but I guess it's good to know this?

One would think so... but I have so many scars concerning incompatibilities with different versions of grep (as do others in the comments). If you don't know, then that feature isn't listed for you. :)

fsckboy · 2023-12-31T04:17:44.000000Z

no snark here, but the subtitle was the start of my confusion: what does "user-friendly" mean in the context of grep, and why should I believe the claim?

regular expressions are not friendly, but the user friendly way for a cli filter to behave is to return retvals appropriately, output to stdout, error messages to stderr... does user friendly mean copious output to stderr? what else could it possibly mean? do I want copious output to stderr?

infamia · 2023-12-31T08:10:03.000000Z

> no snark here, but the subtitle was the start of my confusion: what does "user-friendly" mean in the context of grep, and why should I believe the claim?

Granted, it is far from a thing of beauty, but there is a large, captioned screenshot of the included text user interface just beneath. Then again, it is a website for a command line tool. "Many Bothans died to bring us this information."

stevebmark · 2023-12-30T23:32:07.000000Z

There are many grep variations. The Unix philosophy: do one thing well. The Unix reality: do many things poorly*

*grep, awk, sed