It's probably worth pointing out that OS X is using BSD grep. You will see a phenomenal difference if you install the gnu grep via homebrew:
$ homebrew info grep
GNU grep, egrep and fgrep
https://www.gnu.org/software/grep/
/usr/local/Cellar/grep/3.3 (21 files, 885.3KB) *
Poured from bottle on 2019-01-03 at 12:23:37
From: https://github.com/Homebrew/homebrew-core/blob/master/Formula/grep.rb
...
This is a very arbitrary example, but I've got 1.5Gb application log sitting on my machine right now that'll suffice. I'll even do this in a way that might give a slight performance advantage to BSD, by using gnu grep first:
$ time /usr/local/bin/ggrep "foobarbaz" application.log
real 0m1.319s
user 0m0.948s
sys 0m0.345s
vs OSX's grep:
$ time /usr/bin/grep "foobarbaz" application.log
real 0m37.225s
user 0m31.036s
sys 0m1.286s
1s vs 37s. Same file.
There's nothing odd about the file, straight boring application logs, and the line starts with the logging level in capital letters. So I can use a regex that is pinned to the start of the line, about as optimal as you can get:
first gnugrep:
$ time /usr/local/bin/ggrep -c "^INFO" application.log
1527786
real 0m0.622s
user 0m0.323s
sys 0m0.292s
then OS X's native bsd grep:
$ time /usr/bin/grep -c "^INFO" application.log
1527786
real 0m3.588s
user 0m3.206s
sys 0m0.349s
BSD grep was significantly better than the prior search, but it is still notably slower than the gnu grep.
In my experience, this isn't the most pathological example, but it's close. Note that this also applies to other tools that leverage grep under the surface, like zgrep etc.
I've specifically aliased grep over to ggrep in my shell so that I'm avoiding bsd grep whenever I can:
All the OS X tools are horribly broken due to something fucked up inside their unicode support.
try something like tr:
time gtr a b < application.log > /dev/null
time tr a b < application.log > /dev/null
if you set LC_ALL=c it might be a little faster.
$ time gtr a b < http_10x.log > /dev/null
real 0m0.061s
user 0m0.032s
sys 0m0.028s
$ time tr a b < http_10x.log > /dev/null
real 0m2.284s
user 0m2.267s
sys 0m0.014s
The same LC_ALL=C trick also speeds up GNU tools, sort in particular:
$ shuf /var/log/syslog > shuf-syslog
$ time sort shuf-syslog > /dev/null
real 0m0.536s
user 0m0.870s
sys 0m0.031s
$ time LC_ALL=C sort shuf-syslog > /dev/null
real 0m0.094s
user 0m0.109s
sys 0m0.024s
It's more dramatic with a bigger file, of course.
I once got annoyed with sort's speed, threw together a parallel external sort program that dramatically outperformed it, then realized the difference was not as dramatic with LC_ALL=C...oh well, it was a fun afternoon program anyway.
It's certainly worth considering, but the answer depends on what you're doing with it. If your goal is to binary search over it for an exact string match, for example, the important thing is that your sorting and searching code are consistent. Might as well have both use the simpler/faster LC_ALL=C semantics.
Locale specific sorting is not the same as sorting by codepoint. The Unicode collation algorithm is fairly involved: https://www.unicode.org/reports/tr10/
If you are just using sort because you need to use uniq then you can replace the whole thing with an awk command that gets unique lines without sorting. I think it's faster.
Just don't forget about these aliases since GNU grep and the builtin grep can vary in what they support. I once ran into a head-scratching afternoon where `sed` would work with GNU extensions from the command line but not inside a shell script, and it took far too long to figure it out!
Keep in mind the "BSD grep" in OS X is some ancient version. Some work was done in the last few years to improve it somewhat. Today, I'd skip over both tools in favor of smarter searchers like Ripgrep or Ag (the_silver_searcher).
Why wouldn't pinning to start of line help for some patterns/data? You do still have to find the next newline, but you don't have to make other, additional comparisons once ^x is false, for that line.
Or, roughly, grepping for '^x' instead of 'x' should result in less work if the line does not contain x in the first character. One fewer comparison for each character after the first.
You seem to be thinking about handling a line at a time. In that context, checking for an x at the beginning would be much faster than looking for an x anywhere in the line. However, for anything fast you're going to be handling a large buffer at a time, if not the entire memory-mapped file. In that context, there's no difference between ^x and yx, except that ^x also needs to match at start-of-file (so if anything it should be marginally slower). Finding just x is probably faster than finding yx - longer patterns can be faster when you can skip looking at some fraction of the input (as per TFA) but I expect 2 characters isn't enough to allow that.
Just to elaborate a little on my quick callout. Doing line oriented searches actually requires you to look at every character, if only to find all the line breaks.
Instead, treat the file as a collection of characters and just look for patterns. Which may include the line break character.
That make sense?
Edit: I should also add that you should look for burntsushi's posts. Turns out some of this had become out of date. And largely depends on size of what you are searching.
It's a pity that Apple lets license politics outweigh technological excellence in the software it chooses to ship.
Way back when, the first thing I'd do on any non-GNU host was to install a full GNU userland. I could write a book on my issues with GNU tools, but they are on balance preferable to the alternatives. All IMHO, of course.
In the latest version, if you need to use these commands with their normal names, you can add a "gnubin" directory to your PATH from your bashrc like:
PATH="/usr/local/opt/grep/libexec/gnubin:$PATH"
According to Apple's MacBook Pro marketing page [0], their SSD supports sequential read speeds of up to 3.2GB/s. So this isn't as far-fetched as you imply, even if we're discounting other factors like the filesystem cache.
Yes, I see these specs all the time but I never see them materialize in the real world. Agree about the cache but that's also an easy way to fool yourself when you are measuring performance.
> easy way to fool yourself when you are measuring performance
In this case isn't the "already in RAM" test a more accurate reflection of performance anyway, as we are talking about the performance of grep and not the IO subsystem?
There are many cases where grep's performance won't be bottlenecked by IO, or at least not impacted directly by a bottleneck there. Anywhere when the input is coming from another program, essentially, and even if that task is IO bound it might be sending output to grep much faster than it is consuming input (perhaps gunzip <file | grep searchtext).
And in the case of searching a log file interactively, it is common that you won't just run grep of the file just once in a short space of time, instead doing it a couple of times as you refine your search, so for most of the runs it will be in cache (assuming the file is not to large).
This usage is literally ideal for pretty much any file I/O - it's a straight sequential read. Even spinning rust will achieve >400MB/s on this type of load. Take a look at the sequential burst speeds at QD1, first chart: https://www.anandtech.com/show/13761/the-samsung-970-evo-plu...
Nearly ever SSD listed achieves well over 1GB/s in an actual benchmark, not just on a spec sheet. And these are just boring old off the shelf consumer drives. Nothing crazy.
HDDs almost never sustain 400 MB/s unless you are talking about something pretty exotic. 5200 RPM drives are generally in the 100-130 MB/s range and 7200 proportionally faster but still usually under 200 MB/s.
On my puny laptop with an SSD I get ~400MB/s from cold cache, and ~5GB/s after the first run. So the answer is likely "it's in the FS cache".
That's a very common use-case with grep. Either grepping a file you recently wrote, or running grep multiple times as you refine the regex, at which point the files will be in the FS cache.
This would not help, since the backing storage doesn't provide support for this kind of resolution. It would end up reading in the entire file anyways, unless your input string is on the order of an actual block.
Sure it would help, not for the IO part, but the CPU-bound part of actually checking each character, which is apparently a much lower bound in this case.
Yeah, that's why the article talks about decreasing the amount of CPU work. From the context of disk IO though (which is what this thread seems to be about) this can't help.
Loading the file from disk into memory doesn’t require reads by the CPU. That’s significantly different than the cpu doing comparisons (or even reads) on each byte.
This is why the linked article specifically says it does not look for newlines:
> Moreover, GNU grep AVOIDS BREAKING THE INPUT INTO LINES. Looking for newlines would slow grep down by a factor of several times, because to find the newlines it would have to look at every byte!
It can whenever you don't ask for line numbers, can't it?
> It probably looks for newlines after a match is found
Probably, yeah. Counting number of newlines in a range, without caring just where they fall, can probably be pretty darned fast with vector instructions. Idk if that's worth the additional pass, though.
It doesn't need to look for newlines. That is looking for ^ABCD is not harder than looking for ABCD. The set of matches for the former is a subset of the latter, so if you have some fast way of doing the latter you have a fast way of doing the former (with an additional check for the preceding newline).
Another way of looking at is just considering the ^ another character (plus one-off special handling for start of file).
Others have mentioned the SSD and VFS cache, however spinning disks in a RAID configuration can easily surpass this in raw sequential read performance.
GNU grep is an excellent tool! A similar tool that is rising in popularity is ripgrep (https://github.com/BurntSushi/ripgrep), which can be orders of magnitude faster than GNU grep. Give it a try!
In my personal tests, it's not at all orders of magnitude faster than GNU grep. In fact, for single file, non-unicode greps (most of my usage) the difference is so small as to be imperceptible when interactively using it.
Even for multi-file scenarios, the difference is nowhere near close the difference between GNU grep and BSD grep. This means that compatibility with GNU grep takes priority for me and it's not worth switching over to ripgrep.
It's a simple break down of communication. I try to be upfront about this, but
no matter how hard I try or how many times I try to clarify it, I can't stop
the spread of inaccurate claims. (And my goodness, this entire HN thread has
numerous inaccuracies just with GNU grep alone.)
On equivalent tasks, ripgrep is not orders of magnitude faster than GNU grep,
outside of pathological cases that involve Unicode support. (I can provide
evidence for that if you like.)
For example, in my checkout of the Linux kernel, here's a recursive grep that
searches everything:
$ time LC_ALL=C grep -ar PM_RESUME | wc -l
17
real 1.176
user 0.758
sys 0.407
maxmem 7 MB
faults 0
Now compare that with ripgrep, with a command that uses the same amount of
parallelism and searches the same amount of data:
$ time rg -j1 -uuu PM_RESUME | wc -l
17
real 0.581
user 0.187
sys 0.384
maxmem 7 MB
faults 0
Which is 2x faster, but not "order of magnitude." Now compare it with how long
ripgrep takes using the default command:
$ time rg PM_RESUME | wc -l
17
real 0.125
user 0.646
sys 0.654
maxmem 19 MB
faults 0
At 10x faster, this is where you start to get to "order of magnitude" faster
claims. But for someone who cares about precise claims with respect to
performance, this is uninteresting because ripgrep is 1) using parallelism and
2) skipping some files due to `.gitignore` and other such rules.
You can imagine that if your directory has a lot of large binary files, or if
you're searching in a directory with high latency (a network mount), then you
might see even bigger differences from ripgrep without generally seeing a
difference in search results because ripgrep tends to skip things you don't
care about anyway.
In summary, there is an impedance mismatch when talking about performance
because most people don't have a good working mental model of how these tools
work internally. Many people report on their own perceived performance
improvements and compare that directly to how they used to use grep. They
aren't wrong in a certain light, because ultimately, the user experience is
what matters. But of course, they are wrong in another light if you're
interpreting it as a precise technical claim about the performance
characteristics of a program.
No, that's a bad trick to rely on for speed these days. GNU grep's speed comes from using memchr in its skip loop. Skipping bytes is only going to matter when the thing you're searching for is longer. For short needles, there isn't much opportunity for skipping, so it's better to say inside a vectorized routine like memchr (which, for glibc, is written in Assembly on most platforms).
^ this. For common usage of searching for decently long patterns (>3-6 bytes) with several literal characters/limited wildcarding on a text with a largish alphabet, it is algorithmically very difficult to do _far_ better than GNU Grep’s Boyer-Moore-based implementation.
That situation changes when you have very short patterns, very long patterns, many patterns, or small alphabets (eg DNA). As burnsushi notes, ripgrep’s performance difference for common feel usage comes more from being smarter about its input than from algorithms (its algorithms are solid, of course).
So if you cripple ripgrep, to act like grep, it is not much faster. This is kind of obvious. Ripgrep is faster because it makes assumptions, that will hold true for a large part of its users.
This is like saying "if I remove wings from a plane, then it won't go much faster than my car. So a plane is not technically faster than a car"
Of course, users should know the difference between grep and ripgrep (especially with tweaks like .gitignore, which could be confusing if you don't know).
Yup, that's a known bug. It will be fixed in the next release. I've already done the leg work to fix it, and it should be on master soon. My local copy of ripgrep can now search with my local /usr/share/dict/words as exact patterns just as fast as grep.
This "huge diffrence" is just miliseconds for a 20M LoC, as the author showed in this thread using the linux kernel source, he gives importance to the UX when compared to grep though.
Read the second half of his comment and ask what percentage of the Linux kernel source would be excluded by file type rules versus other projects. I mostly work on web and data archival projects and ripgrep is noticeably seconds (or more!) faster because it can skip images, archives, etc.
If you’re talking about the core algorithm, it’s not that much faster but the clever adjustments for common patterns and modern CPU architectures also counts for real world results. It’s just important to be precise so people know there hasn’t been some fundamental breakthrough in searching.
> ripgrep is built on top of Rust's regex engine. Rust's regex engine uses finite automata, SIMD and aggressive literal optimizations to make searching very fast. Rust's regex library also maintains performance with full Unicode support by building UTF-8 decoding directly into its deterministic finite automaton engine.
The author said in a sister thread[1] that for similar tasks the performance is comparable. ripgrep (and ag) both feel incredibly fast because they skip whole files and directories which are usually unwanted when searching.
Early on in my career, like most novice programmers, I thought that custom written C programs could be much faster than unix tools if written well and for a specific purpose. However, I could not beat the speed of unix tools like grep, cut or cat even once. That is when I realized just how well written these tools are and just how much optimization work has been done.
It's amazing to me how programs like these have seemed to avoid the universal phenomenon of technical debt. They've crystallized into an ideal version of themselves, and haven't continued to decay past that point. Maybe it's because of the Unix philosophy of single-purpose programs; no feature-creep tends to mean no technical debt.
By keeping a narrow focus and avoiding feature creep, the technical debt has been pretty minimal, and they've been able to pay it down easily. There's at least something to learn from the approach.
It does also help that there is a certain type of developer that sees such things as a personal challenge and will go through hell and high water to come up with more and more efficient ways to do things.
There is truth in what you say, however, if you think GNU tools are free of technical debt or feature creep, look into how ./configure works, as an example.
Autoconf code is actually quite clean--you just need to know M4 and have an appreciation of the problems autoconf was built to overcome. What changed over time is that most proprietary unix systems disappeared[1], and POSIX compliance and general consistency have improved dramatically, BSDs and GNU included.
Also, best practices have shifted to a continuous development model which keeps everybody on the upgrade treadmill. There's less concern with maintaining backwards compatibility and catering to those not running the latest environments. So if you make use of some newer Linux kernel API there's only a short window where people will put in the effort to maintain a compatibility mode, assuming they bother at all.
Lastly, containers mean people often develop and ship static environments that can be maintained independently, sometimes never upgraded at all.
What I find interesting is how people have begun to ditch autoconf in favor of even more complex (but newer and therefore cooler) build systems when ironically there's less need than ever for these things. Autoconf doesn't need replacing; such layers can often be left out entirely.
That said, when feature detection and backwards compatibility truly matters there's no good alternative. CMake, for example, effectively requires the end user to install the latest version of CMake, and if you already expect someone to install the latest version of something then why the contrivance at all? I always sigh aloud whenever I download a project that relies on CMake because I know that I now have two problems on my hand, not just one. (But better CMake than the other alternatives--I just won't even bother.)
[1] All that's left are AIX and Solaris. HP-UX/PA-RISC will be officially dead next year, and EOL for HP-UX/Itanium is 2025. From a commercial perspective Solaris seems to be deader than AIX, however Solaris still seems to see more development--especially improved POSIX and Linux compatibility. It's much easier to port to Solaris than AIX. It's a real shame Solaris is disappearing because on big machines with heavy workloads the OOM killer is a fscking nightmare on Linux. Solaris and Windows (and maybe AIX?) are the only operating systems that do proper and thorough memory accounting, permitting you to write reliable software.[2] The cloud services principle that says individual processes are expendable doesn't work when your job takes hours or days to run. (Or even just minutes, because workloads accumulate when the OOM killer starts shooting things down, and even in the cloud you run into hard limits on resource usage--i.e. cap on numbers of nodes. Memory overcommit is just like network buffer bloat--intended to improve things at the small scale but which results in catastrophic degradation at the macro level.)
[2] You can disable overcommit on Linux but the model of overcommit is baked too deeply into the kernel's design. A machine can still end up with the OOM killer shooting down processes if, for example, the rate of dirty page generation outpaces the patience of the allocator trying to reclaim and access memory that is technically otherwise available.
Well, there's a lot of gross things about CMake (the syntax of its DSL is horrifying), but in my experience if you want Windows support it's a lot better than autoconf and make.
good post. i agree that the main problem that autoconf solves has gone away.
cmake is regularly a pain point. its own version, as well as how un-understandable and un-debuggable the cmakelist.txt files are.
sharing code on unix systems used to be painful. now the whole world is linux. until about 10 years ago, the whole world was 'gcc', but nowdays it is more likely to have a clang/gcc difference than it is to autoconf to make the right header to include.
> There is truth in what you say, however, if you think GNU tools are free of technical debt or feature creep, look into how ./configure works, as an example.
./configure is a sad thing, and the cluster of madness around it is even worse (autoconf, automake, and the ridiculous libtool). However, most gnu stuff is excellent, including gnu make. Fortunately, today most unix systems are sufficiently posix-compliant so that you can ship a gnumakefile that builds your program directly without much fuss.
That's romanticizing it a bit... there's been a ton of feature creep[1] and a fair amount of technical debt. It's just more manageable because they've contained the scope of most of the core utilities and packages better than most 'modern' software. Also, don't underestimate the value of having a relatively stable set of maintainers over long periods of time.
[1] If you think about it, things like colorization belong in grep about as much as SVG generation belongs in systemd.
By optionally emitting generic structural markup which you could then use to format via colorization (or whatever other means you wanted) in a terminal, a GUI, a web page, etc. That would have been more consistent with the 'one tool, one job' philosophy of Unix. So then you could do something along the lines of 'grep --emit-markup ... | colorize-markup' (completely made up switch and command names) and use aliases/environment variables/etc. so that to a user could type something like 'cgrep' to get colorized output.
but the structure generator and decoder would be so closely coupled they might as well be the same program.
Alternatively, you could have grep take a list of colors from a config file or environment variable and blindly interpolate those strings to make colored output[1]. This also allows color to show up automatically for interactive use.
I guess what I meant was, most software projects eventually reach a point where they have to be burned down and rebuilt (or simply deprecated in favor of a younger project). Depending on the quality of the engineering it could be 5 years or 25, but it feels like an inevitability. These single-purpose GNU tools seem to be free of that phenomenon.
> most software projects eventually reach a point where they have to be burned down and rebuilt (or simply deprecated in favor of a younger project)
Is it really that, or do new people just come along and rebuild the same thing again and again without firm understanding of the older thing?
One approach would be to call GNU tools very exceptional and marvel at them (keep in mind its origins as a clone of older Unix stuff), but perhaps it's more appropriate to ask tougher questions of people operating in this other modality, that you are calling normal expectations.
If it does many things it becomes a subordinate with an intelligence of its own, something you need to communicate with or talk to, as opposed to something you can simply use
It's possible to beat the speed of cat, actually. See e.g. https://hub.darcs.net/vmchale/fastcat. One trick is to not free the memory used by the buffer :p
For many situations the more complex substring search algorithms gave way to raw, brute force some time ago, I believe. For example, if you are looking for a given relatively short string, you can just take a prefix of say four bytes and then make parallel comparisons within a vector; this simple technique gets you already down to the general area of 1 cpb. The SSE 4.2 PCMPSTR family of instructions is basically the same thing but microcoded, and is a bit faster.
For short patterns, which are imho by far the most common use, any algorithm that tries to be smart and skip a couple bytes wastes cycles on being smart where a simpler brute force algorithm has already fetched the next 16 bytes and started to compare them while the prefetcher already went off to get the next line from L2.
When did you measure SSE4.2 against straightforward vector usage and find it faster? In my experience this was true more or less never. There's almost nothing in SSE4.2 that can't be done better with SSSE3. Intel has slowly deprecated SSE4.2 by not promoting it to wider vectors (it is still 128b when most other instructions are 512b) and letting the throughput/latency stagnate.
1 cycle per byte is not a good result for single string search; you can practically do a shift-and comparison of your input in that speed using general purpose registers.
That work was all in the context of searching for a binary string with a mask. I didn't try too long to optimize it, since performance was quickly satisfactory, so it's not just possible, but rather very likely that my implementation isn't particularly good. I'll keep your advice about SSE 4.2 in mind should I revisit that code.
I wouldn't use "Teddy" to look for single strings, at least not without heavy modification. The boring approach of hunting for a predicate or two with PCMPEQB or equivalent then shift-and'ing things together has worked well in practice for that sort of thing, although it can be a bit brittle if you get the predicate(s) wrong.
It does not. ripgrep does not use Boyer-Moore in most searches.
In particular, the advice in the OP is generally out of date. The "secret" sauce to ripgrep's speed in simple literal searches is a simple heuristic: choose the rarest byte in the needle and feed that to memchr. (The "heuristic" is that you can't actually know the optimal choice, but it turns out that a guess works pretty well most of time since most things you search have a similar frequency distribution.)
The SSSE3 optimizations come from Hyperscan, and are only applicable when searching a small number of small patterns. e.g., `Holmes|Watson|Moriarty|Adler`.
In other words, for common searches (which are short strings), it is much better to spend more time in a vectorized routine than to try to skip bytes.
> In other words, for common searches (which are short strings), it is much better to spend more time in a vectorized routine than to try to skip bytes.
Complexity analysis of substring search focuses on the number of comparisons – at least those I saw –, much like sorting, and of course that's not an accurate model at all.
This is wrong. Boyer-Moore does the same thing. It just always selects the last byte to feed into memchr. If that byte happens to be rare, then it works great. But if it happens to be common, then it gets into the weeds pretty quickly. My suggested heuristic does the same thing, but increases the chances that it selects a rare byte.
And yes, the performance difference can be very large. Here's an example on a ~9.3GB file:
$ time rg --no-mmap 'Sherlock ' OpenSubtitles2016.raw.en | wc -l
6698
real 3.006
user 1.658
sys 1.345
maxmem 8 MB
faults 0
$ time grep 'Sherlock ' OpenSubtitles2016.raw.en | wc -l
6698
real 9.023
user 7.921
sys 1.092
maxmem 8 MB
faults 0
Notice that the pattern is `Sherlock `. The last byte is an ASCII space character, which is incredibly common. Boyer-Moore blindly picks this as the skip byte, but ripgrep uses a simple pre-computed frequency table to select a better byte as the skip byte.
> Presumably this is not too bad for ripgrep itself, as long as it falls back to something sensible when the assumption fails.
It does. That's why I said, "ripgrep does not use Boyer-Moore in most searches."
What pathological case are you referring to? The worst case I can think of is a completely random haystack in which case it is approximately equivalent to a naive search.
Even Boyer-Moore is not superior to a naive search in literally every case, e.g. short needle or large alphabet.
Interesting. Could this perform even faster if you (for large files) took a random sample of bytes before doing the search, to get the best distribution? Or just update the distribution as the search progresses.
It's probably more terrible than its worth, and you'd need to figure out how to take the sample in a way that doesn't kill performance. It would be an interesting experiment, but the usual heuristic is just to give up on fast skipping if it's ineffective.
I typically search for the largest string I can nowadays. Though, I suspect even those are not large enough to tip the needle, since I'm usually just talking about a uuid.
I've been using ack (https://beyondgrep.com/) for a while and it seems to be better suited for my specific purpose -- looking into source files and avoiding things like data dumps and VCS. It may be slower than grep, but it faster in the sense that I don't spend time remembering all the parameter combinations when I search for something.
Here are a few aliases I use:
alias ack="ack --color" # color output
alias ackl="ack -l" # show file names only
acksubl () {
ack -l -i "${1}" | xargs subl
}
# Do case-insensitive search and open files in sublime.
Turns out in practice I don't use regular expressions very often when searching text, and the most frequent question is -- where is this function/variable might have been used?
Seems the articles are free, if you don't want to pay the €6085 yearly instituational subscription.
Other than Adrian's https://blog.acolyer.org/ what else is there worth watching that is closer to engineering than theory? I wish we still had DDJ or C/C++ User Journal, but today's periodicals are things like Rasperry Pi enthusiast or Monthly Minecraft Tips.
You might already know this, but the current release of ripgrep has support for PCRE2 (as opposed to PCRE1 in ag). All you need to do is pass the -P flag. If you want it enabled by default, then:
Yes, though it should be noted that the reason it's faster is fewer command invocations and ignoring usually unwanted files (like .git/ and files listed in .gitignore).
grep has a -R option which allows you to avoid the first problem, but avoiding the second one is a bit more difficult.
Also there is ripgrep which I've heard is effectively a faster version of ag written in Rust.
Can you file a bug report? Crashing is a serious bug, and I don't think one has been reported against ripgrep during a search for quite some time, so I'd be very keen on hearing about it.
Thanks! Yeah that was the last crashing related bug, due to a problem doing streaming transcoding on one of ruby's utf16 encoded files. It's fixed on master.
The second one isn't that hard for common cases - I've got `grepr` for `grep -r --exclude-dir .git --exclude-dir .mypy_cache --exclude-dir __pycache__`. Adjust as you like for the particular sorts of things you commonly need to ignore.
Except you also need to do `--exclude *.o` and whatever else is present in the .gitignore of any subdirectory. I personally use "git grep" when searching in source trees, as it automatically gives you gitignore support.
I think the author phrased it somewhat differently, but my understanding is grep's high throughput also comes from its use of what we (Computer Engineering grad research group) referred to as a variable state machine. A colleague was researching implementing analogous on an FPGA, for gigabit line-speed throughput. Preferred Computer Science term is apparently Thompson's NFA. https://en.wikipedia.org/wiki/Thompson%27s_construction#The_...
Thompson's NFA construction, when used directly to search via an NFA simulation, is dog slow. It does give you a O(nm) search, but in practice it's slow. AIUI, GNU grep uses a lazy DFA, which uses Thompson's NFA construction to build a DFA at search time. This does indeed lead to pretty good performance for the regex engine. But GNU grep's speed largely comes from optimizing the common case: extracting literals from your pattern, identifying candidate matching lines, and then checking with the full regex engine to confirm (or reject) the match.
I suspect Thompson's NFA is not inherently dog slow (Glushkov can be done reasonably fast for decent-sized NFAs). The fact is that most Thompson-lineage engines opted for the 'lazy DFA' approach and optimized that (which is effective until it isn't). I imagine a more aggressive 'native' Thompson NFA is possible. A nice benefit of that is not having to write to your bytecode - there's a good deal of systems-level complexity stuff in RE2 that springs out of a consequence of the 'lazy DFA construction' decision.
That being said, matching literals is always going to be faster, especially if you decompose the pattern to get more use out of your literal matcher - the downside of filtration is that if the literal is always present, you are just doing strictly more work. At least with decomposition you've taken the literal out of the picture. See https://branchfree.org/2019/02/28/paper-hyperscan-a-fast-mul... for those who don't know what I'm talking about (I know you've read it).
Am flirting with doing another regex engine that gets some of the benefit of decomposition and literal matching without taking on the nosebleed complexity of Hyperscan...
Do you know of any fast Thompson NFA simulation implementation? I don't think I've seen one outside of a JIT.
Is there a fast glushkov implementation that isn't bit parallel? I've never been able to figure out how to use bit parallel approaches with large Unicode classes. Just using a single Unicode aware \w puts it into the weeds pretty quickly. That's where the lazy DFA shines, because it doesn't need to build the full DFA for \w (which is quite large, even after the standard DFA compression tricks).
Unicode is a PITA. In Hyperscan, it's not pretty what gets generated for a bare \w in UCP mode if you force it into an NFA (it's rather more tractable as a DFA, even if you aren't lazily generating, although of course betting the farm that you can always 'busily' generate a DFA isn't great).
I've always thought that a better job of doing NFAs (Gluskov or otherwise) and staying bit-parallel would be done with having character reachability on codepoints, not bytes, generally remapping down to 'which codepoints make an actual difference'. This sounds ugly/terrifying, but the nice thing is that remapping a long stream of codepoints could be done in parallel (as it's not hard to find boundaries) and with SIMD. Step by step NFA or DFA work is more ponderous as every state depends on previous states.
Yeah, I've looked at glushkov based primarily on your comments about it, but Unicode is always where I get stuck. In my regex engine, Unicode is enabled by default and \w is fairly common, so it needs to be handled well.
And of course, one doesn't need to bet the farm on a lazy DFA if you have one, although it is quite robust in a large number of practical scenarios. (I think RE2 does bet the farm, to be fair.)
Unicode + UCP is a perfectly principled thing, but it wasn't a design point that made any sense for Hyperscan as a default. The bulk of our customers were not interested in turning 1 state for ASCII \w into 600 states for UCP \w unless it was free.
I think both Glushkov and Thompson can be done fast, but I agree that they are both going to be Really Big for UCP stuff. Idle discussions among the ex-Hyperscan folks generally leans towards 'NFA over codepoints' being the right way of doing things.
Occam's razor suggests if you do only 1 thing in a regex system (i.e. designing for simplicity/elegance, which would be an interesting change after Hyperscan) it must be NFA, as not all patterns determinize. If you are OK with a lazy DFA system that can be made to create a new state per byte of input (in the worst case) then I guess you can do that too.
I am not sure how to solve the problem of "NFA over codepoints", btw. Having no more than 256 distinct characters was easy, but even with remapping, the prospect of having to handle arbitrary Unicode is... unnerving.
Yeah, my Thompson NFA uses codepoints for those reasons. But not in particularly smart way; mostly just to reduce space usage. It is indeed an annoying problem to deal with!
... and no, I don't know of any fast Thompson NFA simulations, but I don't see why they shouldn't be possible. They have a very simple "next" function, modulo the awfulness of getting past epsilons, but that seems to be roughly parallel to the awfulness of computing arbitrary 'next' functions in Glushkov-land. I'm not aware of anyone that's actually tried.
Presumably on macOS? Basically all BSD- or GNU-derived tools that ship on macOS are ancient, ancient versions that deserve to be burnt away and should not be considered indicative of what the current state-of-the-art is capable of.
The other commenter in this thread pointed out that this is a very old version and the newer bsd version is better.
real 0m2.044s
user 0m1.932s
sys 0m0.085s
wgl@pondera:~$ time grep LiteonTe *.text | wc -l
11020
real 0m1.939s
user 0m1.905s
sys 0m0.038s
wgl@pondera:~$ time ggrep LiteonTe *.text | wc -l
11020
real 0m0.130s
user 0m0.087s
sys 0m0.037s
wgl@pondera:~$ time ggrep LiteonTe *.text | wc -l
11020
real 0m0.119s
user 0m0.088s
sys 0m0.035s
wgl@pondera:~$ du -h -s *.text
128M Kismetkismet-kali-pondera-20190325-08-46-27-1.pcapdump.text
First one was done to cache then the number discarded. Thus the 'grep' you see above is the second run over the 128 mb pcap file expanded with tshark.
Dramatic.
I'll stay with the gnu grep and not update the regular ones for now.
It would help if you tested just grep when benchmarking grep. These datapoints tell a much different story.
# /usr/bin/grep -V
grep (BSD grep) 2.6.0-FreeBSD
root@m6600:~ # /usr/local/bin/grep -V
grep (GNU grep) 3.3
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <https://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Written by Mike Haertel and others; see
<https://git.sv.gnu.org/cgit/grep.git/tree/AUTHORS>.
root@m6600:~ # /usr/bin/time /usr/bin/grep X-User-Agent packetdump.pcap -c
60
0.54 real 0.45 user 0.07 sys
root@m6600:~ # /usr/bin/time /usr/bin/grep X-User-Agent packetdump.pcap -c
60
0.54 real 0.44 user 0.08 sys
root@m6600:~ # /usr/bin/time /usr/bin/grep X-User-Agent packetdump.pcap -c
60
0.54 real 0.41 user 0.11 sys
root@m6600:~ # /usr/bin/time /usr/local/bin/grep X-User-Agent packetdump.pcap -c
60
0.58 real 0.49 user 0.08 sys
root@m6600:~ # /usr/bin/time /usr/local/bin/grep X-User-Agent packetdump.pcap -c
60
0.60 real 0.48 user 0.11 sys
root@m6600:~ # /usr/bin/time /usr/local/bin/grep X-User-Agent packetdump.pcap -c
60
0.59 real 0.50 user 0.08 sys
root@m6600:~ # du -h -s packetdump.pcap
225M packetdump.pcap
That is a very good point. Taking this better approach, here is what I get on my (not updated grep) system:
wgl:$ /usr/bin/grep --version
/usr/bin/grep --version
grep (BSD grep) 2.5.1-FreeBSD
wgl:$ /usr/local/bin/ggrep --version
/usr/local/bin/ggrep --version
ggrep (GNU grep) 3.3
Packaged by Homebrew
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <https://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Written by Mike Haertel and others; see
<https://git.sv.gnu.org/cgit/grep.git/tree/AUTHORS>.
wgl:$ /usr/bin/time /usr/local/bin/ggrep LiteonTe really-big.text | wc -l
/usr/bin/time /usr/local/bin/ggrep LiteonTe really-big.text | wc -l
2.30 real 1.04 user 0.67 sys
1228
wgl:$ /usr/bin/time /usr/bin/grep LiteonTe really-big.text | wc -l
/usr/bin/time /usr/bin/grep LiteonTe really-big.text | wc -l
5.65 real 5.30 user 0.33 sys
1228
wgl:$ /usr/bin/time /usr/local/bin/ggrep LiteonTe really-big.text >/dev/null
/usr/bin/time /usr/local/bin/ggrep LiteonTe really-big.text >/dev/null
0.05 real 0.03 user 0.01 sys
wgl:$ /usr/bin/time /usr/bin/grep LiteonTe really-big.text >/dev/null
/usr/bin/time /usr/bin/grep LiteonTe really-big.text >/dev/null
6.50 real 5.71 user 0.58 sys
wgl:$ /usr/bin/time /usr/local/bin/ggrep LiteonTe really-big.text -c
/usr/bin/time /usr/local/bin/ggrep LiteonTe really-big.text -c
1228
2.33 real 1.05 user 0.69 sys
wgl:$ /usr/bin/time /usr/bin/grep LiteonTe really-big.text -c
/usr/bin/time /usr/bin/grep LiteonTe really-big.text -c
1228
5.37 real 5.05 user 0.31 sys
The wc -l is clearly polluting the result. However, I suspect that the >/dev/null is as well. But in the worst case, I see a halving of time over the old grep (edited), which correlates with my most common use of grep in looking through source files.
Compiler is also going to make an impact which for me is consistent across both grep binaries.
FreeBSD clang version 6.0.1 as well as -O2
I suspect there are still edge cases where BSD grep is quite a bit slower or not compatible with GNU grep. However with a closer apples to apples comparison there isn't much difference anymore for my usage. Which is a lot of grep use but that is pretty vanilla.
There may also be other OS differences in our comparison. My tests where run against a fairly recent FreeBSD 12-STABLE.
Grep is a damn useful tool, I use it every day. But is it just me or does grep sacrifice convenience for performance ? Eg. it rather skips some bytes, then to make sure it finds everything. And case sensitive is default, where you most of the time want to find all case variations.
I often have the experience of being like, "Surely I can't just do this entire task with a bunch of greps in a timely manner, it's going to take forever, right?" and then boom, the script takes like a minute to finish sorting through bazillions of lines.
Yes, Boyer-Moore is ridiculously fast, but it's a string search algorithm, not a regular expression search algorithm. You can't use it for regexes. Is TFA still true when you're not searching for literal strings?
Yes, because it will extract literals from your regex and seaech for those first. When a match of the literal is found, all you then need to do is find the line boundaries of that match and run the regex engine on just that line. It ends up working pretty well.
Of course, if your regex has no required literals at all, then this optimization can't be performed. GNU grep can slow down quite a bit in these cases, especially if you have a non-C locale enabled and use some Unicode aware features such as \w.
I keep this on my bookmarks: [1] (that I fished from this other blog post [2]).
Not sure if any actual search tools other than Scalyr use that specific implementation, but it makes for an interesting read anyway if you are into substring search algorithms (the algo described improves on Boyer-Moore).
This would depend upon which version of GNU grep one installs...
About 10 years ago (when this article was written), I upgraded a RedHat system and discovered that some of my scripts were running 10 times slower than they had on the previous install. I isolated the issue to grep and benchmarked the grep from the previous release against the current release. The new grep was 10 times slower. I didn't have much more time to waste on the issue so I simply copied the grep binary from the previous release to the new install and everything was fine afterward.
This sounds a lot like when they started making it handle Unicode correctly. We ran into the same problem but fixed it by setting `LC_ALL=C` in the affected log processing scripts.
Just in case anybody wants to look at the code, I think these two files [0, 1] contain the main logic when it comes to searching. Please correct me if these files aren't the right ones:
Well, the Netflix folks claim the exact opposite :-)
Also, note that there were some quite fundamental optimizations in FreeBSD 12, such as introduction of the “epoch” mechanism - basically RCU. 11 still only uses old-fashioned locks.
This is simply a very good technical mailing list discussion. Instead of the whole BSD vs. GNU thing, you get to the point information sharing on processing structures.
I do wonder about the later messages discussing the other string matching algorithms (on of them using a Trie to easily track back), while they are considered in the discussion, I didn't find any search/regex implementing them.
I must've been temporarily blind while scanning the src, thanks! The kinds of implementations that bring together various data structures and algorithms are usually rather interesting to read and learn from.
I think this is originally to avoid any possible plagiarism claims by Unix--if a Unix tool was originally optimized for one of size/efficiency/speed/memory use, the GNU tools were optimized for one of the others. This is part of why the `yes` utility is so freakin fast in GNU, despite being kind of a weird use case for speed.
I don't know if it is or isn't the plagiarism thing but GNU tools were written with a very different, way more modern style than Unix tools.
Unix was pretty much "do a quick loop using terse code" and "see what fits in this static small buffer, and if it doesn't, quietly truncate". GNU would reallocate buffers to fit, and take reasonable precautions on performing well on tiny/huge inputs.
GNU was actually considered bloated by the Unix crowd ("why so many more kilobytes of code?" ). It's an approach that scaled better for the future though.
See section 2.1 of [0] and the top response to this comment [1]. You'll note in the GNU standards there, it encourages optimizing for the opposite of Unix to avoid collisions in strategy.
That’s an interesting term in this context. The technique is so clever and the gains so big that it does feel a “cheat”. But as long as it replicates the behavior and effects and constraints of the previous algorithm, is it really a “cheat” even in the loosest sense of the word? It feels far more apt to view prior work as “tryhard”
There's nothing odd about the file, straight boring application logs, and the line starts with the logging level in capital letters. So I can use a regex that is pinned to the start of the line, about as optimal as you can get:
first gnugrep:
then OS X's native bsd grep: BSD grep was significantly better than the prior search, but it is still notably slower than the gnu grep.In my experience, this isn't the most pathological example, but it's close. Note that this also applies to other tools that leverage grep under the surface, like zgrep etc.
I've specifically aliased grep over to ggrep in my shell so that I'm avoiding bsd grep whenever I can: